2025-05-07T20:22:35.2725960Z Current runner version: '2.323.0' 2025-05-07T20:22:35.2732574Z Runner name: 'i-0c2643f2bcfaf5e6b' 2025-05-07T20:22:35.2733572Z Machine name: 'ip-10-0-1-116' 2025-05-07T20:22:35.2736373Z ##[group]GITHUB_TOKEN Permissions 2025-05-07T20:22:35.2738729Z Contents: read 2025-05-07T20:22:35.2739245Z Metadata: read 2025-05-07T20:22:35.2739730Z Packages: read 2025-05-07T20:22:35.2740327Z ##[endgroup] 2025-05-07T20:22:35.2742682Z Secret source: None 2025-05-07T20:22:35.2743741Z Prepare workflow directory 2025-05-07T20:22:35.3263071Z Prepare all required actions 2025-05-07T20:22:35.3299673Z Getting action download info 2025-05-07T20:22:35.5620465Z Download action repository 'actions/checkout@v4' (SHA:11bd71901bbe5b1630ceea73d27597364c9af683) 2025-05-07T20:22:35.8397091Z Download action repository 'actions/download-artifact@v4' (SHA:d3f86a106a0bac45b974a628896c90dbdf5c8093) 2025-05-07T20:22:36.2004470Z Download action repository 'pytorch/test-infra@main' (SHA:117fccdf5892ff9a958d2afb4b4b8b6e930d3187) 2025-05-07T20:22:37.8107259Z Getting action download info 2025-05-07T20:22:37.9011455Z Download action repository 'nick-fields/retry@3e91a01664abd3c5cd539100d10d33b9c5b68482' (SHA:3e91a01664abd3c5cd539100d10d33b9c5b68482) 2025-05-07T20:22:38.1343941Z Complete job name: test_and_publish_artifact (x86, linux.g5.4xlarge.nvidia.gpu, genai, 3.10, 12.6.3, 12.6.3, gcc) 2025-05-07T20:22:38.1960699Z A job started hook has been configured by the self-hosted runner administrator 2025-05-07T20:22:38.2095422Z ##[group]Run '/home/ec2-user/runner-scripts/before_job.sh' 2025-05-07T20:22:38.2108451Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:22:38.2109983Z ##[endgroup] 2025-05-07T20:22:39.3939490Z Runner Type: linux.g5.4xlarge.nvidia.gpu 2025-05-07T20:22:39.3940082Z Instance Type: g5.4xlarge 2025-05-07T20:22:39.3940342Z AMI Name: unknown 2025-05-07T20:22:39.3981344Z AMI ID: ami-071226ecf16aa7d96 2025-05-07T20:22:44.7477913Z ##[group]Run actions/checkout@v4 2025-05-07T20:22:44.7478222Z with: 2025-05-07T20:22:44.7478449Z submodules: true 2025-05-07T20:22:44.7478687Z repository: pytorch/FBGEMM 2025-05-07T20:22:44.7479090Z token: *** 2025-05-07T20:22:44.7479292Z ssh-strict: true 2025-05-07T20:22:44.7479503Z ssh-user: git 2025-05-07T20:22:44.7479727Z persist-credentials: true 2025-05-07T20:22:44.7479975Z clean: true 2025-05-07T20:22:44.7480201Z sparse-checkout-cone-mode: true 2025-05-07T20:22:44.7480471Z fetch-depth: 1 2025-05-07T20:22:44.7480686Z fetch-tags: false 2025-05-07T20:22:44.7480899Z show-progress: true 2025-05-07T20:22:44.7481122Z lfs: false 2025-05-07T20:22:44.7481324Z set-safe-directory: true 2025-05-07T20:22:44.7481575Z env: 2025-05-07T20:22:44.7481781Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:22:44.7482087Z BUILD_ENV: build_binary 2025-05-07T20:22:44.7482367Z BUILD_TARGET: genai 2025-05-07T20:22:44.7482587Z BUILD_VARIANT: cuda 2025-05-07T20:22:44.7482851Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:22:44.7483101Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:22:44.7483339Z ##[endgroup] 2025-05-07T20:22:44.8636749Z Syncing repository: pytorch/FBGEMM 2025-05-07T20:22:44.8637939Z ##[group]Getting Git version info 2025-05-07T20:22:44.8638378Z Working directory is '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM' 2025-05-07T20:22:44.8638988Z [command]/usr/bin/git version 2025-05-07T20:22:44.8639250Z git version 2.47.1 2025-05-07T20:22:44.8645277Z ##[endgroup] 2025-05-07T20:22:44.8668146Z Temporarily overriding HOME='/home/ec2-user/actions-runner/_work/_temp/af869ebb-95fa-41ed-9d48-4e5f3a9a72b2' before making global git config changes 2025-05-07T20:22:44.8669056Z Adding repository directory to the temporary git global config as a safe directory 2025-05-07T20:22:44.8673066Z [command]/usr/bin/git config --global --add safe.directory /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM 2025-05-07T20:22:44.8710264Z Deleting the contents of '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM' 2025-05-07T20:22:44.8713003Z ##[group]Initializing the repository 2025-05-07T20:22:44.8717125Z [command]/usr/bin/git init /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM 2025-05-07T20:22:44.8760266Z hint: Using 'master' as the name for the initial branch. This default branch name 2025-05-07T20:22:44.8760913Z hint: is subject to change. To configure the initial branch name to use in all 2025-05-07T20:22:44.8761449Z hint: of your new repositories, which will suppress this warning, call: 2025-05-07T20:22:44.8761824Z hint: 2025-05-07T20:22:44.8762112Z hint: git config --global init.defaultBranch 2025-05-07T20:22:44.8762442Z hint: 2025-05-07T20:22:44.8762763Z hint: Names commonly chosen instead of 'master' are 'main', 'trunk' and 2025-05-07T20:22:44.8763307Z hint: 'development'. The just-created branch can be renamed via this command: 2025-05-07T20:22:44.8763715Z hint: 2025-05-07T20:22:44.8763942Z hint: git branch -m 2025-05-07T20:22:44.8764460Z Initialized empty Git repository in /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/ 2025-05-07T20:22:44.8774338Z [command]/usr/bin/git remote add origin https://github.com/pytorch/FBGEMM 2025-05-07T20:22:44.8809022Z ##[endgroup] 2025-05-07T20:22:44.8809737Z ##[group]Disabling automatic garbage collection 2025-05-07T20:22:44.8813772Z [command]/usr/bin/git config --local gc.auto 0 2025-05-07T20:22:44.8845843Z ##[endgroup] 2025-05-07T20:22:44.8846439Z ##[group]Setting up auth 2025-05-07T20:22:44.8853271Z [command]/usr/bin/git config --local --name-only --get-regexp core\.sshCommand 2025-05-07T20:22:44.8886044Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'core\.sshCommand' && git config --local --unset-all 'core.sshCommand' || :" 2025-05-07T20:22:44.9251909Z [command]/usr/bin/git config --local --name-only --get-regexp http\.https\:\/\/github\.com\/\.extraheader 2025-05-07T20:22:44.9284795Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'http\.https\:\/\/github\.com\/\.extraheader' && git config --local --unset-all 'http.https://github.com/.extraheader' || :" 2025-05-07T20:22:44.9635830Z [command]/usr/bin/git config --local http.https://github.com/.extraheader AUTHORIZATION: basic *** 2025-05-07T20:22:44.9686944Z ##[endgroup] 2025-05-07T20:22:44.9687622Z ##[group]Fetching the repository 2025-05-07T20:22:44.9696505Z [command]/usr/bin/git -c protocol.version=2 fetch --no-tags --prune --no-recurse-submodules --depth=1 origin +a2f4c52051596e74bc8c16e3d2867a4ecdd271e0:refs/remotes/pull/4066/merge 2025-05-07T20:22:45.3163001Z From https://github.com/pytorch/FBGEMM 2025-05-07T20:22:45.3163529Z * [new ref] a2f4c52051596e74bc8c16e3d2867a4ecdd271e0 -> pull/4066/merge 2025-05-07T20:22:45.3188068Z ##[endgroup] 2025-05-07T20:22:45.3188457Z ##[group]Determining the checkout info 2025-05-07T20:22:45.3191321Z ##[endgroup] 2025-05-07T20:22:45.3195873Z [command]/usr/bin/git sparse-checkout disable 2025-05-07T20:22:45.3230386Z [command]/usr/bin/git config --local --unset-all extensions.worktreeConfig 2025-05-07T20:22:45.3257718Z ##[group]Checking out the ref 2025-05-07T20:22:45.3261709Z [command]/usr/bin/git checkout --progress --force refs/remotes/pull/4066/merge 2025-05-07T20:22:45.4354720Z Note: switching to 'refs/remotes/pull/4066/merge'. 2025-05-07T20:22:45.4355048Z 2025-05-07T20:22:45.4355297Z You are in 'detached HEAD' state. You can look around, make experimental 2025-05-07T20:22:45.4355927Z changes and commit them, and you can discard any commits you make in this 2025-05-07T20:22:45.4356435Z state without impacting any branches by switching back to a branch. 2025-05-07T20:22:45.4356742Z 2025-05-07T20:22:45.4356954Z If you want to create a new branch to retain commits you create, you may 2025-05-07T20:22:45.4357432Z do so (now or later) by using -c with the switch command. Example: 2025-05-07T20:22:45.4357701Z 2025-05-07T20:22:45.4357823Z git switch -c 2025-05-07T20:22:45.4358011Z 2025-05-07T20:22:45.4358139Z Or undo this operation with: 2025-05-07T20:22:45.4358313Z 2025-05-07T20:22:45.4358408Z git switch - 2025-05-07T20:22:45.4358885Z 2025-05-07T20:22:45.4359114Z Turn off this advice by setting config variable advice.detachedHead to false 2025-05-07T20:22:45.4359444Z 2025-05-07T20:22:45.4359824Z HEAD is now at a2f4c52 Merge 6060cd4b5f971680caecdcc657faccb5720d1c3e into fd4df5f456e0cca514bacd98a39efb72990fd9f4 2025-05-07T20:22:45.4368449Z ##[endgroup] 2025-05-07T20:22:45.4368852Z ##[group]Setting up auth for fetching submodules 2025-05-07T20:22:45.4373874Z [command]/usr/bin/git config --global http.https://github.com/.extraheader AUTHORIZATION: basic *** 2025-05-07T20:22:45.4423474Z [command]/usr/bin/git config --global --unset-all url.https://github.com/.insteadOf 2025-05-07T20:22:45.4454945Z [command]/usr/bin/git config --global --add url.https://github.com/.insteadOf git@github.com: 2025-05-07T20:22:45.4486488Z [command]/usr/bin/git config --global --add url.https://github.com/.insteadOf org-21003710@github.com: 2025-05-07T20:22:45.4515706Z ##[endgroup] 2025-05-07T20:22:45.4516094Z ##[group]Fetching submodules 2025-05-07T20:22:45.4518424Z [command]/usr/bin/git submodule sync 2025-05-07T20:22:45.4862760Z [command]/usr/bin/git -c protocol.version=2 submodule update --init --force --depth=1 2025-05-07T20:22:45.5193567Z Submodule 'external/asmjit' (https://github.com/asmjit/asmjit.git) registered for path 'external/asmjit' 2025-05-07T20:22:45.5195757Z Submodule 'external/composable_kernel' (https://github.com/jwfromm/composable_kernel.git) registered for path 'external/composable_kernel' 2025-05-07T20:22:45.5198160Z Submodule 'external/cpuinfo' (https://github.com/pytorch/cpuinfo) registered for path 'external/cpuinfo' 2025-05-07T20:22:45.5201475Z Submodule 'external/cutlass' (https://github.com/jwfromm/cutlass) registered for path 'external/cutlass' 2025-05-07T20:22:45.5204968Z Submodule 'external/googletest' (https://github.com/google/googletest) registered for path 'external/googletest' 2025-05-07T20:22:45.5208725Z Submodule 'external/hipify_torch' (https://github.com/ROCmSoftwarePlatform/hipify_torch.git) registered for path 'external/hipify_torch' 2025-05-07T20:22:45.5211933Z Submodule 'external/json' (https://github.com/nlohmann/json.git) registered for path 'external/json' 2025-05-07T20:22:45.5242743Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/asmjit'... 2025-05-07T20:22:45.8854804Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/composable_kernel'... 2025-05-07T20:22:46.3693432Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/cpuinfo'... 2025-05-07T20:22:46.8146500Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/cutlass'... 2025-05-07T20:22:47.9688167Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/googletest'... 2025-05-07T20:22:48.2280562Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/hipify_torch'... 2025-05-07T20:22:48.5195539Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/json'... 2025-05-07T20:22:49.7025151Z From https://github.com/asmjit/asmjit 2025-05-07T20:22:49.7025755Z * branch e5d7c0bd5d9aec44d68830187138149e6a8c4e32 -> FETCH_HEAD 2025-05-07T20:22:49.7497751Z Submodule path 'external/asmjit': checked out 'e5d7c0bd5d9aec44d68830187138149e6a8c4e32' 2025-05-07T20:22:50.3724727Z From https://github.com/jwfromm/composable_kernel 2025-05-07T20:22:50.3725205Z * branch 4a61bdd4bd4ed730e078aebc7c0fcf046ff29406 -> FETCH_HEAD 2025-05-07T20:22:50.6530431Z Submodule path 'external/composable_kernel': checked out '4a61bdd4bd4ed730e078aebc7c0fcf046ff29406' 2025-05-07T20:22:51.2691514Z From https://github.com/pytorch/cpuinfo 2025-05-07T20:22:51.2692001Z * branch 6543fec09b2f04ac4a666882998b534afc9c1349 -> FETCH_HEAD 2025-05-07T20:22:51.3693447Z Submodule path 'external/cpuinfo': checked out '6543fec09b2f04ac4a666882998b534afc9c1349' 2025-05-07T20:22:52.4916542Z From https://github.com/jwfromm/cutlass 2025-05-07T20:22:52.4917061Z * branch 3ed8d2ec4ba35ef5d9d8353826209b6f868f63d3 -> FETCH_HEAD 2025-05-07T20:22:53.1909475Z Submodule path 'external/cutlass': checked out '3ed8d2ec4ba35ef5d9d8353826209b6f868f63d3' 2025-05-07T20:22:54.1331104Z From https://github.com/google/googletest 2025-05-07T20:22:54.1331562Z * branch f8d7d77c06936315286eb55f8de22cd23c188571 -> FETCH_HEAD 2025-05-07T20:22:54.1739790Z Submodule path 'external/googletest': checked out 'f8d7d77c06936315286eb55f8de22cd23c188571' 2025-05-07T20:22:54.8851652Z From https://github.com/ROCmSoftwarePlatform/hipify_torch 2025-05-07T20:22:54.8852592Z * branch 420084499c7c1e1c2d801922f40df202eac5f3a0 -> FETCH_HEAD 2025-05-07T20:22:54.8935285Z Submodule path 'external/hipify_torch': checked out '420084499c7c1e1c2d801922f40df202eac5f3a0' 2025-05-07T20:22:55.6040030Z From https://github.com/nlohmann/json 2025-05-07T20:22:55.6040693Z * branch 9cca280a4d0ccf0c08f47a99aa71d1b0e52f8d03 -> FETCH_HEAD 2025-05-07T20:22:55.7145147Z Submodule path 'external/json': checked out '9cca280a4d0ccf0c08f47a99aa71d1b0e52f8d03' 2025-05-07T20:22:55.7165089Z [command]/usr/bin/git submodule foreach git config --local gc.auto 0 2025-05-07T20:22:55.7500114Z Entering 'external/asmjit' 2025-05-07T20:22:55.7532544Z Entering 'external/composable_kernel' 2025-05-07T20:22:55.7564394Z Entering 'external/cpuinfo' 2025-05-07T20:22:55.7596828Z Entering 'external/cutlass' 2025-05-07T20:22:55.7628677Z Entering 'external/googletest' 2025-05-07T20:22:55.7660216Z Entering 'external/hipify_torch' 2025-05-07T20:22:55.7692639Z Entering 'external/json' 2025-05-07T20:22:55.7735747Z ##[endgroup] 2025-05-07T20:22:55.7736137Z ##[group]Persisting credentials for submodules 2025-05-07T20:22:55.7742760Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'url\.https\:\/\/github\.com\/\.insteadOf' && git config --local --unset-all 'url.https://github.com/.insteadOf' || :" 2025-05-07T20:22:55.8073798Z Entering 'external/asmjit' 2025-05-07T20:22:55.8140079Z Entering 'external/composable_kernel' 2025-05-07T20:22:55.8216235Z Entering 'external/cpuinfo' 2025-05-07T20:22:55.8282840Z Entering 'external/cutlass' 2025-05-07T20:22:55.8362557Z Entering 'external/googletest' 2025-05-07T20:22:55.8429984Z Entering 'external/hipify_torch' 2025-05-07T20:22:55.8495897Z Entering 'external/json' 2025-05-07T20:22:55.8580406Z [command]/usr/bin/git submodule foreach sh -c "git config --local 'http.https://github.com/.extraheader' 'AUTHORIZATION: basic ***' && git config --local --show-origin --name-only --get-regexp remote.origin.url" 2025-05-07T20:22:55.8906604Z Entering 'external/asmjit' 2025-05-07T20:22:55.8968105Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/asmjit/config remote.origin.url 2025-05-07T20:22:55.8970602Z Entering 'external/composable_kernel' 2025-05-07T20:22:55.9031594Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/composable_kernel/config remote.origin.url 2025-05-07T20:22:55.9034547Z Entering 'external/cpuinfo' 2025-05-07T20:22:55.9095695Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/cpuinfo/config remote.origin.url 2025-05-07T20:22:55.9098803Z Entering 'external/cutlass' 2025-05-07T20:22:55.9159736Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/cutlass/config remote.origin.url 2025-05-07T20:22:55.9162740Z Entering 'external/googletest' 2025-05-07T20:22:55.9224430Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/googletest/config remote.origin.url 2025-05-07T20:22:55.9227428Z Entering 'external/hipify_torch' 2025-05-07T20:22:55.9288835Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/hipify_torch/config remote.origin.url 2025-05-07T20:22:55.9292757Z Entering 'external/json' 2025-05-07T20:22:55.9354909Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/json/config remote.origin.url 2025-05-07T20:22:55.9456890Z [command]/usr/bin/git submodule foreach git config --local --add 'url.https://github.com/.insteadOf' 'git@github.com:' 2025-05-07T20:22:55.9786151Z Entering 'external/asmjit' 2025-05-07T20:22:55.9819925Z Entering 'external/composable_kernel' 2025-05-07T20:22:55.9853067Z Entering 'external/cpuinfo' 2025-05-07T20:22:55.9885839Z Entering 'external/cutlass' 2025-05-07T20:22:55.9917667Z Entering 'external/googletest' 2025-05-07T20:22:55.9949980Z Entering 'external/hipify_torch' 2025-05-07T20:22:55.9982562Z Entering 'external/json' 2025-05-07T20:22:56.0031758Z [command]/usr/bin/git submodule foreach git config --local --add 'url.https://github.com/.insteadOf' 'org-21003710@github.com:' 2025-05-07T20:22:56.0350938Z Entering 'external/asmjit' 2025-05-07T20:22:56.0382459Z Entering 'external/composable_kernel' 2025-05-07T20:22:56.0413780Z Entering 'external/cpuinfo' 2025-05-07T20:22:56.0445302Z Entering 'external/cutlass' 2025-05-07T20:22:56.0478672Z Entering 'external/googletest' 2025-05-07T20:22:56.0510270Z Entering 'external/hipify_torch' 2025-05-07T20:22:56.0541223Z Entering 'external/json' 2025-05-07T20:22:56.0584241Z ##[endgroup] 2025-05-07T20:22:56.0645155Z [command]/usr/bin/git log -1 --format=%H 2025-05-07T20:22:56.0655244Z a2f4c52051596e74bc8c16e3d2867a4ecdd271e0 2025-05-07T20:22:56.0837094Z ##[group]Run actions/download-artifact@v4 2025-05-07T20:22:56.0837391Z with: 2025-05-07T20:22:56.0837625Z name: fbgemm_genai_x86_gcc_py3.10_cu12.6.3.whl 2025-05-07T20:22:56.0837933Z merge-multiple: false 2025-05-07T20:22:56.0838178Z repository: pytorch/FBGEMM 2025-05-07T20:22:56.0838417Z run-id: 14891846252 2025-05-07T20:22:56.0838629Z env: 2025-05-07T20:22:56.0838851Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:22:56.0839132Z BUILD_ENV: build_binary 2025-05-07T20:22:56.0839369Z BUILD_TARGET: genai 2025-05-07T20:22:56.0839580Z BUILD_VARIANT: cuda 2025-05-07T20:22:56.0839807Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:22:56.0840043Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:22:56.0840267Z ##[endgroup] 2025-05-07T20:22:56.3145825Z Downloading single artifact 2025-05-07T20:22:56.4118796Z Preparing to download the following artifacts: 2025-05-07T20:22:56.4119721Z - fbgemm_genai_x86_gcc_py3.10_cu12.6.3.whl (ID: 3081361682, Size: 12507040, Expected Digest: sha256:54786970e5b7d46c26833313b7eb27e7a268d8dcd818a1c2bdaca6edadbd9a0b) 2025-05-07T20:22:56.4878966Z Redirecting to blob download url: https://productionresultssa4.blob.core.windows.net/actions-results/b81c1ade-b872-4473-afc9-b227c140a38f/workflow-job-run-1bd3b0b6-3733-53f4-b996-74ebe9e5efe1/artifacts/1754a5081fdead90bf158dc66d782ebfca5c7dcf5e2261bf900fbf3d44fedad1.zip 2025-05-07T20:22:56.4880365Z Starting download of artifact to: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM 2025-05-07T20:22:56.5836794Z (node:57041) [DEP0005] DeprecationWarning: Buffer() is deprecated due to security and usability issues. Please use the Buffer.alloc(), Buffer.allocUnsafe(), or Buffer.from() methods instead. 2025-05-07T20:22:56.5837735Z (Use `node --trace-deprecation ...` to show where the warning was created) 2025-05-07T20:22:56.8406346Z SHA256 digest of downloaded artifact is 54786970e5b7d46c26833313b7eb27e7a268d8dcd818a1c2bdaca6edadbd9a0b 2025-05-07T20:22:56.8406947Z Artifact download completed successfully. 2025-05-07T20:22:56.8407275Z Total of 1 artifact(s) downloaded 2025-05-07T20:22:56.8412599Z Download artifact has finished successfully 2025-05-07T20:22:56.8674323Z ##[group]Run pytorch/test-infra/.github/actions/setup-nvidia@main 2025-05-07T20:22:56.8674708Z with: 2025-05-07T20:22:56.8674918Z driver-version: 570.133.07 2025-05-07T20:22:56.8675158Z env: 2025-05-07T20:22:56.8675368Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:22:56.8675660Z BUILD_ENV: build_binary 2025-05-07T20:22:56.8675899Z BUILD_TARGET: genai 2025-05-07T20:22:56.8676112Z BUILD_VARIANT: cuda 2025-05-07T20:22:56.8676343Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:22:56.8676594Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:22:56.8676822Z ##[endgroup] 2025-05-07T20:22:56.8770566Z ##[group]Run nick-fields/retry@3e91a01664abd3c5cd539100d10d33b9c5b68482 2025-05-07T20:22:56.8770938Z with: 2025-05-07T20:22:56.8771323Z timeout_minutes: 10 2025-05-07T20:22:56.8771556Z max_attempts: 3 2025-05-07T20:22:56.8795168Z command: # Is it disgusting to have a full shell script here in this github action? Sure # But is it the best way to make it so that this action relies on nothing else? Absolutely set -eou pipefail DISTRIBUTION=$(. /etc/os-release;echo $ID$VERSION_ID) DRIVER_FN="NVIDIA-Linux-x86_64-${DRIVER_VERSION}.run" install_nvidia_docker2_amzn2() { ( set -x # Needed for yum-config-manager sudo yum install -y yum-utils if [[ "${DISTRIBUTION}" == "amzn2023" ]] ; then YUM_REPO_URL="https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo" else # Amazon Linux 2 YUM_REPO_URL="https://nvidia.github.io/nvidia-docker/${DISTRIBUTION}/nvidia-docker.repo" fi sudo yum-config-manager --add-repo "${YUM_REPO_URL}" sudo yum install -y nvidia-docker2 nvidia-container-toolkit-1.16.2 sudo systemctl restart docker ) } install_nvidia_docker2_ubuntu20() { ( set -x # Install nvidia-driver package if not installed status="$(dpkg-query -W --showformat='${db:Status-Status}' nvidia-docker2 2>&1)" if [ ! $? = 0 ] || [ ! "$status" = installed ]; then sudo apt-get install -y nvidia-docker2 nvidia-container-toolkit-1.16.2 sudo systemctl restart docker fi ) } pre_install_nvidia_driver_amzn2() { ( # Purge any nvidia driver installed from RHEL repo sudo yum remove -y nvidia-driver-latest-dkms ) } install_nvidia_driver_common() { ( # Try to gather more information about the runner and its existing NVIDIA driver if any echo "Before installing NVIDIA driver" lspci lsmod modinfo nvidia || true HAS_NVIDIA_DRIVER=0 # Check if NVIDIA driver has already been installed if [ -x "$(command -v nvidia-smi)" ]; then set +e # The driver exists, check its version next. Also check only the first GPU if there are more than one of them # so that the same driver version is not print over multiple lines INSTALLED_DRIVER_VERSION=$(nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0) NVIDIA_SMI_STATUS=$? if [ "$NVIDIA_SMI_STATUS" -ne 0 ] && [ "$NVIDIA_SMI_STATUS" -ne 14 ]; then echo "Failed to get NVIDIA driver version ($INSTALLED_DRIVER_VERSION). Continuing" elif [ "$INSTALLED_DRIVER_VERSION" != "$DRIVER_VERSION" ]; then echo "NVIDIA driver ($INSTALLED_DRIVER_VERSION) has been installed, but we expect to have $DRIVER_VERSION instead. Continuing" # Turn off persistent mode so that the installation script can unload the kernel module sudo killall nvidia-persistenced || true else HAS_NVIDIA_DRIVER=1 echo "NVIDIA driver ($INSTALLED_DRIVER_VERSION) has already been installed. Skipping NVIDIA driver installation" fi set -e fi if [ "$HAS_NVIDIA_DRIVER" -eq 0 ]; then # CAUTION: this may need to be updated in future if [ "${DISTRIBUTION}" != ubuntu20.04 ]; then sudo yum groupinstall -y "Development Tools" # ensure our kernel install is the same as our underlying kernel, # groupinstall "Development Tools" has a habit of mismatching kernel headers sudo yum install -y "kernel-devel-uname-r == $(uname -r)" sudo modprobe backlight fi sudo curl -fsL -o /tmp/nvidia_driver "https://s3.amazonaws.com/ossci-linux/nvidia_driver/$DRIVER_FN" set +e sudo /bin/bash /tmp/nvidia_driver -s --no-drm NVIDIA_INSTALLATION_STATUS=$? RESET_GPU=0 if [ "$NVIDIA_INSTALLATION_STATUS" -ne 0 ]; then sudo cat /var/log/nvidia-installer.log # Fail to install NVIDIA driver, try to reset the GPU RESET_GPU=1 elif [ -x "$(command -v nvidia-smi)" ]; then # Check again if nvidia-smi works even if the driver installation completes successfully INSTALLED_DRIVER_VERSION=$(nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0) NVIDIA_SMI_STATUS=$? if [ "$NVIDIA_SMI_STATUS" -ne 0 ] && [ "$NVIDIA_SMI_STATUS" -ne 14 ]; then RESET_GPU=1 fi fi if [ "$RESET_GPU" -eq 1 ]; then NVIDIA_DEVICES=$(lspci -D | grep -i NVIDIA | cut -d' ' -f1) # The GPU can get stuck in a failure state if somehow the test crashs the GPU microcode. When this # happens, we'll try to reset all NVIDIA devices https://github.com/pytorch/pytorch/issues/88388 for PCI_ID in $NVIDIA_DEVICES; do DEVICE_ENABLED=$(cat /sys/bus/pci/devices/$PCI_ID/enable) echo "Reseting $PCI_ID (enabled state: $DEVICE_ENABLED)" # This requires sudo permission of course echo "1" | sudo tee /sys/bus/pci/devices/$PCI_ID/reset sleep 1 done fi sudo rm -fv /tmp/nvidia_driver set -e fi ) } post_install_nvidia_driver_common() { ( sudo modprobe nvidia || true echo "After installing NVIDIA driver" lspci lsmod modinfo nvidia || true ( set +e nvidia-smi # NB: Annoyingly, nvidia-smi command returns successfully with return code 0 even in # the case where the driver has already crashed as it still can get the driver version # and some basic information like the bus ID. However, the rest of the information # would be missing (ERR!), for example: # # +-----------------------------------------------------------------------------+ # | NVIDIA-SMI 525.89.02 Driver Version: 525.89.02 CUDA Version: 12.0 | # |-------------------------------+----------------------+----------------------+ # | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | # | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | # | | | MIG M. | # |===============================+======================+======================| # | 0 ERR! Off | 00000000:00:1E.0 Off | ERR! | # |ERR! ERR! ERR! ERR! / ERR! | 4184MiB / 23028MiB | ERR! Default | # | | | ERR! | # +-------------------------------+----------------------+----------------------+ # # +-----------------------------------------------------------------------------+ # | Processes: | # | GPU GI CI PID Type Process name GPU Memory | # | ID ID Usage | # |=============================================================================| # +-----------------------------------------------------------------------------+ # # This should be reported as a failure instead as it will guarantee to fail when # Docker tries to run with --gpus all # # So, the correct check here is to query one of the missing piece of info like # GPU name, so that the command can fail accordingly nvidia-smi --query-gpu=gpu_name --format=csv,noheader --id=0 NVIDIA_SMI_STATUS=$? # Allowable exit statuses for nvidia-smi, see: https://github.com/NVIDIA/gpu-operator/issues/285 if [ "$NVIDIA_SMI_STATUS" -eq 0 ] || [ "$NVIDIA_SMI_STATUS" -eq 14 ]; then echo "INFO: Ignoring allowed status ${NVIDIA_SMI_STATUS}" else echo "ERROR: nvidia-smi exited with unresolved status ${NVIDIA_SMI_STATUS}" exit ${NVIDIA_SMI_STATUS} fi set -e ) ) } install_nvidia_driver_amzn2() { ( set -x pre_install_nvidia_driver_amzn2 install_nvidia_driver_common post_install_nvidia_driver_common ) } install_nvidia_driver_ubuntu20() { ( set -x install_nvidia_driver_common post_install_nvidia_driver_common ) } echo "== Installing nvidia driver ${DRIVER_FN} ==" case "${DISTRIBUTION}" in amzn*) install_nvidia_driver_amzn2 ;; ubuntu20.04) install_nvidia_driver_ubuntu20 ;; *) echo "ERROR: Unknown distribution ${DISTRIBUTION}" exit 1 ;; esac # Install container toolkit based on distribution echo "== Installing nvidia container toolkit for ${DISTRIBUTION} ==" case "${DISTRIBUTION}" in amzn*) install_nvidia_docker2_amzn2 ;; ubuntu20.04) install_nvidia_docker2_ubuntu20 ;; *) echo "ERROR: Unknown distribution ${DISTRIBUTION}" exit 1 ;; esac echo "GPU_FLAG=--gpus all -e NVIDIA_DRIVER_CAPABILITIES=all" >> "${GITHUB_ENV}" # Fix https://github.com/NVIDIA/nvidia-docker/issues/1648 on runners with # more than one GPUs. This just needs to be run once. The command fails # on subsequent runs and complains that the mode is already on, but that's # ok sudo nvidia-persistenced || true # This should show persistence mode ON nvidia-smi 2025-05-07T20:22:56.8818495Z retry_wait_seconds: 10 2025-05-07T20:22:56.8818757Z polling_interval_seconds: 1 2025-05-07T20:22:56.8819010Z warning_on_retry: true 2025-05-07T20:22:56.8819251Z continue_on_error: false 2025-05-07T20:22:56.8819486Z env: 2025-05-07T20:22:56.8819708Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:22:56.8820091Z BUILD_ENV: build_binary 2025-05-07T20:22:56.8820332Z BUILD_TARGET: genai 2025-05-07T20:22:56.8820555Z BUILD_VARIANT: cuda 2025-05-07T20:22:56.8820798Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:22:56.8821052Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:22:56.8821292Z DRIVER_VERSION: 570.133.07 2025-05-07T20:22:56.8821534Z ##[endgroup] 2025-05-07T20:22:56.9628677Z == Installing nvidia driver NVIDIA-Linux-x86_64-570.133.07.run == 2025-05-07T20:22:56.9629568Z + pre_install_nvidia_driver_amzn2 2025-05-07T20:22:56.9632413Z + sudo yum remove -y nvidia-driver-latest-dkms 2025-05-07T20:22:57.5030352Z No match for argument: nvidia-driver-latest-dkms 2025-05-07T20:22:57.5030789Z No packages marked for removal. 2025-05-07T20:22:57.5094541Z Dependencies resolved. 2025-05-07T20:22:57.5104150Z Nothing to do. 2025-05-07T20:22:57.5104401Z Complete! 2025-05-07T20:22:57.5436092Z + install_nvidia_driver_common 2025-05-07T20:22:57.5441601Z + echo 'Before installing NVIDIA driver' 2025-05-07T20:22:57.5441896Z + lspci 2025-05-07T20:22:57.5443581Z Before installing NVIDIA driver 2025-05-07T20:22:57.5625385Z 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] 2025-05-07T20:22:57.5629068Z 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II] 2025-05-07T20:22:57.5642469Z 00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08) 2025-05-07T20:22:57.5643325Z 00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111 2025-05-07T20:22:57.5644103Z 00:04.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe EBS Controller 2025-05-07T20:22:57.5644953Z 00:05.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA) 2025-05-07T20:22:57.5645741Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1) 2025-05-07T20:22:57.5646544Z 00:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller 2025-05-07T20:22:57.5647198Z + lsmod 2025-05-07T20:22:57.5673782Z Module Size Used by 2025-05-07T20:22:57.5674306Z xt_conntrack 16384 1 2025-05-07T20:22:57.5674735Z nft_chain_nat 16384 3 2025-05-07T20:22:57.5675152Z xt_MASQUERADE 20480 1 2025-05-07T20:22:57.5675647Z nf_nat 57344 2 nft_chain_nat,xt_MASQUERADE 2025-05-07T20:22:57.5676150Z nf_conntrack_netlink 57344 0 2025-05-07T20:22:57.5676817Z nf_conntrack 184320 4 xt_conntrack,nf_nat,nf_conntrack_netlink,xt_MASQUERADE 2025-05-07T20:22:57.5677558Z nf_defrag_ipv6 24576 1 nf_conntrack 2025-05-07T20:22:57.5678072Z nf_defrag_ipv4 16384 1 nf_conntrack 2025-05-07T20:22:57.5678519Z xfrm_user 57344 1 2025-05-07T20:22:57.5678953Z xfrm_algo 16384 1 xfrm_user 2025-05-07T20:22:57.5679410Z xt_addrtype 16384 2 2025-05-07T20:22:57.5679812Z nft_compat 20480 4 2025-05-07T20:22:57.5680258Z nf_tables 311296 57 nft_compat,nft_chain_nat 2025-05-07T20:22:57.5680899Z nfnetlink 20480 4 nft_compat,nf_conntrack_netlink,nf_tables 2025-05-07T20:22:57.5681504Z br_netfilter 36864 0 2025-05-07T20:22:57.5681954Z bridge 323584 1 br_netfilter 2025-05-07T20:22:57.5682421Z stp 16384 1 bridge 2025-05-07T20:22:57.5682859Z llc 16384 2 bridge,stp 2025-05-07T20:22:57.5683283Z overlay 167936 0 2025-05-07T20:22:57.5683663Z tls 135168 0 2025-05-07T20:22:57.5684046Z nls_ascii 16384 1 2025-05-07T20:22:57.5684458Z nls_cp437 20480 1 2025-05-07T20:22:57.5684867Z vfat 24576 1 2025-05-07T20:22:57.5685255Z fat 86016 1 vfat 2025-05-07T20:22:57.5685681Z sunrpc 696320 1 2025-05-07T20:22:57.5686070Z ena 180224 0 2025-05-07T20:22:57.5686452Z i8042 45056 0 2025-05-07T20:22:57.5686858Z serio 28672 3 i8042 2025-05-07T20:22:57.5687299Z button 24576 0 2025-05-07T20:22:57.5687726Z ghash_clmulni_intel 16384 0 2025-05-07T20:22:57.5688169Z dm_mod 188416 0 2025-05-07T20:22:57.5688583Z sch_fq_codel 20480 17 2025-05-07T20:22:57.5689001Z fuse 163840 1 2025-05-07T20:22:57.5689392Z loop 36864 0 2025-05-07T20:22:57.5690134Z configfs 57344 1 2025-05-07T20:22:57.5690576Z dax 45056 1 dm_mod 2025-05-07T20:22:57.5691010Z dmi_sysfs 20480 0 2025-05-07T20:22:57.5691428Z crc32_pclmul 16384 0 2025-05-07T20:22:57.5691852Z crc32c_intel 24576 0 2025-05-07T20:22:57.5692273Z efivarfs 24576 1 2025-05-07T20:22:57.5692731Z + modinfo nvidia 2025-05-07T20:22:57.5693350Z filename: /lib/modules/6.1.130-139.222.amzn2023.x86_64/kernel/drivers/video/nvidia.ko 2025-05-07T20:22:57.5694079Z import_ns: DMA_BUF 2025-05-07T20:22:57.5694464Z alias: char-major-195-* 2025-05-07T20:22:57.5694891Z version: 570.133.07 2025-05-07T20:22:57.5695299Z supported: external 2025-05-07T20:22:57.5695686Z license: Dual MIT/GPL 2025-05-07T20:22:57.5696160Z firmware: nvidia/570.133.07/gsp_tu10x.bin 2025-05-07T20:22:57.5696677Z firmware: nvidia/570.133.07/gsp_ga10x.bin 2025-05-07T20:22:57.5697510Z srcversion: 49515739FD8F721A3F2F714 2025-05-07T20:22:57.5698039Z alias: pci:v000010DEd*sv*sd*bc06sc80i00* 2025-05-07T20:22:57.5698641Z alias: pci:v000010DEd*sv*sd*bc03sc02i00* 2025-05-07T20:22:57.5699161Z alias: pci:v000010DEd*sv*sd*bc03sc00i00* 2025-05-07T20:22:57.5699643Z depends: i2c-core,drm 2025-05-07T20:22:57.5700175Z retpoline: Y 2025-05-07T20:22:57.5700517Z name: nvidia 2025-05-07T20:22:57.5701078Z vermagic: 6.1.130-139.222.amzn2023.x86_64 SMP preempt mod_unload modversions 2025-05-07T20:22:57.5701823Z parm: NvSwitchRegDwords:NvSwitch regkey (charp) 2025-05-07T20:22:57.5702549Z parm: NvSwitchBlacklist:NvSwitchBlacklist=uuid[,uuid...] (charp) 2025-05-07T20:22:57.5703476Z parm: NVreg_ResmanDebugLevel:int 2025-05-07T20:22:57.5704071Z parm: NVreg_RmLogonRC:int 2025-05-07T20:22:57.5704583Z parm: NVreg_ModifyDeviceFiles:int 2025-05-07T20:22:57.5705069Z parm: NVreg_DeviceFileUID:int 2025-05-07T20:22:57.5705572Z parm: NVreg_DeviceFileGID:int 2025-05-07T20:22:57.5706245Z parm: NVreg_DeviceFileMode:int 2025-05-07T20:22:57.5706812Z parm: NVreg_InitializeSystemMemoryAllocations:int 2025-05-07T20:22:57.5707368Z parm: NVreg_UsePageAttributeTable:int 2025-05-07T20:22:57.5707707Z parm: NVreg_EnablePCIeGen3:int 2025-05-07T20:22:57.5708010Z parm: NVreg_EnableMSI:int 2025-05-07T20:22:57.5708310Z parm: NVreg_EnableStreamMemOPs:int 2025-05-07T20:22:57.5708670Z parm: NVreg_RestrictProfilingToAdminUsers:int 2025-05-07T20:22:57.5709059Z parm: NVreg_PreserveVideoMemoryAllocations:int 2025-05-07T20:22:57.5709425Z parm: NVreg_EnableS0ixPowerManagement:int 2025-05-07T20:22:57.5709848Z parm: NVreg_S0ixPowerManagementVideoMemoryThreshold:int 2025-05-07T20:22:57.5710253Z parm: NVreg_DynamicPowerManagement:int 2025-05-07T20:22:57.5710670Z parm: NVreg_DynamicPowerManagementVideoMemoryThreshold:int 2025-05-07T20:22:57.5711090Z parm: NVreg_EnableGpuFirmware:int 2025-05-07T20:22:57.5711429Z parm: NVreg_EnableGpuFirmwareLogs:int 2025-05-07T20:22:57.5711798Z parm: NVreg_OpenRmEnableUnsupportedGpus:int 2025-05-07T20:22:57.5712160Z parm: NVreg_EnableUserNUMAManagement:int 2025-05-07T20:22:57.5712506Z parm: NVreg_MemoryPoolSize:int 2025-05-07T20:22:57.5712827Z parm: NVreg_KMallocHeapMaxSize:int 2025-05-07T20:22:57.5713179Z parm: NVreg_VMallocHeapMaxSize:int 2025-05-07T20:22:57.5713504Z parm: NVreg_IgnoreMMIOCheck:int 2025-05-07T20:22:57.5713813Z parm: NVreg_NvLinkDisable:int 2025-05-07T20:22:57.5714149Z parm: NVreg_EnablePCIERelaxedOrderingMode:int 2025-05-07T20:22:57.5714511Z parm: NVreg_RegisterPCIDriver:int 2025-05-07T20:22:57.5714838Z parm: NVreg_EnableResizableBar:int 2025-05-07T20:22:57.5715189Z parm: NVreg_EnableDbgBreakpoint:int 2025-05-07T20:22:57.5715528Z parm: NVreg_EnableNonblockingOpen:int 2025-05-07T20:22:57.5715876Z parm: NVreg_RegistryDwords:charp 2025-05-07T20:22:57.5716215Z parm: NVreg_RegistryDwordsPerDevice:charp 2025-05-07T20:22:57.5716536Z parm: NVreg_RmMsg:charp 2025-05-07T20:22:57.5716830Z parm: NVreg_GpuBlacklist:charp 2025-05-07T20:22:57.5717154Z parm: NVreg_TemporaryFilePath:charp 2025-05-07T20:22:57.5717471Z parm: NVreg_ExcludedGpus:charp 2025-05-07T20:22:57.5717781Z parm: NVreg_DmaRemapPeerMmio:int 2025-05-07T20:22:57.5718108Z parm: NVreg_RmNvlinkBandwidth:charp 2025-05-07T20:22:57.5718457Z parm: NVreg_RmNvlinkBandwidthLinkCount:int 2025-05-07T20:22:57.5718792Z parm: NVreg_ImexChannelCount:int 2025-05-07T20:22:57.5719118Z parm: NVreg_CreateImexChannel0:int 2025-05-07T20:22:57.5719462Z parm: NVreg_GrdmaPciTopoCheckOverride:int 2025-05-07T20:22:57.5719785Z parm: rm_firmware_active:charp 2025-05-07T20:22:57.5720221Z + HAS_NVIDIA_DRIVER=0 2025-05-07T20:22:57.5720469Z ++ command -v nvidia-smi 2025-05-07T20:22:57.5720722Z + '[' -x /usr/bin/nvidia-smi ']' 2025-05-07T20:22:57.5720979Z + set +e 2025-05-07T20:22:57.5721286Z ++ nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0 2025-05-07T20:22:59.3673819Z + INSTALLED_DRIVER_VERSION=570.133.07 2025-05-07T20:22:59.3674156Z + NVIDIA_SMI_STATUS=0 2025-05-07T20:22:59.3674710Z + '[' 0 -ne 0 ']' 2025-05-07T20:22:59.3674936Z + '[' 570.133.07 '!=' 570.133.07 ']' 2025-05-07T20:22:59.3675208Z + HAS_NVIDIA_DRIVER=1 2025-05-07T20:22:59.3675647Z + echo 'NVIDIA driver (570.133.07) has already been installed. Skipping NVIDIA driver installation' 2025-05-07T20:22:59.3676110Z + set -e 2025-05-07T20:22:59.3677132Z + '[' 1 -eq 0 ']' 2025-05-07T20:22:59.3677528Z NVIDIA driver (570.133.07) has already been installed. Skipping NVIDIA driver installation 2025-05-07T20:22:59.3677999Z + post_install_nvidia_driver_common 2025-05-07T20:22:59.3680994Z + sudo modprobe nvidia 2025-05-07T20:22:59.5183454Z + echo 'After installing NVIDIA driver' 2025-05-07T20:22:59.5183879Z + lspci 2025-05-07T20:22:59.5184176Z After installing NVIDIA driver 2025-05-07T20:22:59.5300476Z 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] 2025-05-07T20:22:59.5301173Z 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II] 2025-05-07T20:22:59.5301819Z 00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08) 2025-05-07T20:22:59.5302335Z 00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111 2025-05-07T20:22:59.5302816Z 00:04.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe EBS Controller 2025-05-07T20:22:59.5303537Z 00:05.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA) 2025-05-07T20:22:59.5304226Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1) 2025-05-07T20:22:59.5304707Z 00:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller 2025-05-07T20:22:59.5305112Z + lsmod 2025-05-07T20:22:59.5331939Z Module Size Used by 2025-05-07T20:22:59.5332372Z nvidia_uvm 1884160 0 2025-05-07T20:22:59.5332784Z nvidia 11583488 1 nvidia_uvm 2025-05-07T20:22:59.5333178Z drm 602112 1 nvidia 2025-05-07T20:22:59.5333576Z drm_panel_orientation_quirks 32768 1 drm 2025-05-07T20:22:59.5333890Z backlight 24576 1 drm 2025-05-07T20:22:59.5334181Z i2c_core 110592 2 nvidia,drm 2025-05-07T20:22:59.5334580Z xt_conntrack 16384 1 2025-05-07T20:22:59.5334948Z nft_chain_nat 16384 3 2025-05-07T20:22:59.5335312Z xt_MASQUERADE 20480 1 2025-05-07T20:22:59.5335629Z nf_nat 57344 2 nft_chain_nat,xt_MASQUERADE 2025-05-07T20:22:59.5335991Z nf_conntrack_netlink 57344 0 2025-05-07T20:22:59.5336388Z nf_conntrack 184320 4 xt_conntrack,nf_nat,nf_conntrack_netlink,xt_MASQUERADE 2025-05-07T20:22:59.5336827Z nf_defrag_ipv6 24576 1 nf_conntrack 2025-05-07T20:22:59.5337142Z nf_defrag_ipv4 16384 1 nf_conntrack 2025-05-07T20:22:59.5337434Z xfrm_user 57344 1 2025-05-07T20:22:59.5337703Z xfrm_algo 16384 1 xfrm_user 2025-05-07T20:22:59.5337987Z xt_addrtype 16384 2 2025-05-07T20:22:59.5338264Z nft_compat 20480 4 2025-05-07T20:22:59.5338562Z nf_tables 311296 57 nft_compat,nft_chain_nat 2025-05-07T20:22:59.5338971Z nfnetlink 20480 4 nft_compat,nf_conntrack_netlink,nf_tables 2025-05-07T20:22:59.5339346Z br_netfilter 36864 0 2025-05-07T20:22:59.5339617Z bridge 323584 1 br_netfilter 2025-05-07T20:22:59.5340026Z stp 16384 1 bridge 2025-05-07T20:22:59.5340313Z llc 16384 2 bridge,stp 2025-05-07T20:22:59.5340594Z overlay 167936 0 2025-05-07T20:22:59.5340852Z tls 135168 0 2025-05-07T20:22:59.5341103Z nls_ascii 16384 1 2025-05-07T20:22:59.5341621Z nls_cp437 20480 1 2025-05-07T20:22:59.5341872Z vfat 24576 1 2025-05-07T20:22:59.5342126Z fat 86016 1 vfat 2025-05-07T20:22:59.5342391Z sunrpc 696320 1 2025-05-07T20:22:59.5342631Z ena 180224 0 2025-05-07T20:22:59.5342876Z i8042 45056 0 2025-05-07T20:22:59.5343128Z serio 28672 3 i8042 2025-05-07T20:22:59.5343391Z button 24576 0 2025-05-07T20:22:59.5343647Z ghash_clmulni_intel 16384 0 2025-05-07T20:22:59.5343905Z dm_mod 188416 0 2025-05-07T20:22:59.5344153Z sch_fq_codel 20480 17 2025-05-07T20:22:59.5344411Z fuse 163840 1 2025-05-07T20:22:59.5344657Z loop 36864 0 2025-05-07T20:22:59.5345078Z configfs 57344 1 2025-05-07T20:22:59.5345335Z dax 45056 1 dm_mod 2025-05-07T20:22:59.5345609Z dmi_sysfs 20480 0 2025-05-07T20:22:59.5345860Z crc32_pclmul 16384 0 2025-05-07T20:22:59.5346118Z crc32c_intel 24576 0 2025-05-07T20:22:59.5346372Z efivarfs 24576 1 2025-05-07T20:22:59.5346620Z + modinfo nvidia 2025-05-07T20:22:59.5349581Z filename: /lib/modules/6.1.130-139.222.amzn2023.x86_64/kernel/drivers/video/nvidia.ko 2025-05-07T20:22:59.5350241Z import_ns: DMA_BUF 2025-05-07T20:22:59.5350595Z alias: char-major-195-* 2025-05-07T20:22:59.5350959Z version: 570.133.07 2025-05-07T20:22:59.5351297Z supported: external 2025-05-07T20:22:59.5351556Z license: Dual MIT/GPL 2025-05-07T20:22:59.5351841Z firmware: nvidia/570.133.07/gsp_tu10x.bin 2025-05-07T20:22:59.5352170Z firmware: nvidia/570.133.07/gsp_ga10x.bin 2025-05-07T20:22:59.5352490Z srcversion: 49515739FD8F721A3F2F714 2025-05-07T20:22:59.5352815Z alias: pci:v000010DEd*sv*sd*bc06sc80i00* 2025-05-07T20:22:59.5353144Z alias: pci:v000010DEd*sv*sd*bc03sc02i00* 2025-05-07T20:22:59.5353478Z alias: pci:v000010DEd*sv*sd*bc03sc00i00* 2025-05-07T20:22:59.5353789Z depends: i2c-core,drm 2025-05-07T20:22:59.5354050Z retpoline: Y 2025-05-07T20:22:59.5354358Z name: nvidia 2025-05-07T20:22:59.5354844Z vermagic: 6.1.130-139.222.amzn2023.x86_64 SMP preempt mod_unload modversions 2025-05-07T20:22:59.5355434Z parm: NvSwitchRegDwords:NvSwitch regkey (charp) 2025-05-07T20:22:59.5355868Z parm: NvSwitchBlacklist:NvSwitchBlacklist=uuid[,uuid...] (charp) 2025-05-07T20:22:59.5356277Z parm: NVreg_ResmanDebugLevel:int 2025-05-07T20:22:59.5356584Z parm: NVreg_RmLogonRC:int 2025-05-07T20:22:59.5356877Z parm: NVreg_ModifyDeviceFiles:int 2025-05-07T20:22:59.5357188Z parm: NVreg_DeviceFileUID:int 2025-05-07T20:22:59.5357485Z parm: NVreg_DeviceFileGID:int 2025-05-07T20:22:59.5357784Z parm: NVreg_DeviceFileMode:int 2025-05-07T20:22:59.5358140Z parm: NVreg_InitializeSystemMemoryAllocations:int 2025-05-07T20:22:59.5358528Z parm: NVreg_UsePageAttributeTable:int 2025-05-07T20:22:59.5358857Z parm: NVreg_EnablePCIeGen3:int 2025-05-07T20:22:59.5359197Z parm: NVreg_EnableMSI:int 2025-05-07T20:22:59.5359500Z parm: NVreg_EnableStreamMemOPs:int 2025-05-07T20:22:59.5359856Z parm: NVreg_RestrictProfilingToAdminUsers:int 2025-05-07T20:22:59.5360243Z parm: NVreg_PreserveVideoMemoryAllocations:int 2025-05-07T20:22:59.5360618Z parm: NVreg_EnableS0ixPowerManagement:int 2025-05-07T20:22:59.5361023Z parm: NVreg_S0ixPowerManagementVideoMemoryThreshold:int 2025-05-07T20:22:59.5361423Z parm: NVreg_DynamicPowerManagement:int 2025-05-07T20:22:59.5361837Z parm: NVreg_DynamicPowerManagementVideoMemoryThreshold:int 2025-05-07T20:22:59.5362237Z parm: NVreg_EnableGpuFirmware:int 2025-05-07T20:22:59.5362575Z parm: NVreg_EnableGpuFirmwareLogs:int 2025-05-07T20:22:59.5362933Z parm: NVreg_OpenRmEnableUnsupportedGpus:int 2025-05-07T20:22:59.5363428Z parm: NVreg_EnableUserNUMAManagement:int 2025-05-07T20:22:59.5363805Z parm: NVreg_MemoryPoolSize:int 2025-05-07T20:22:59.5364151Z parm: NVreg_KMallocHeapMaxSize:int 2025-05-07T20:22:59.5364516Z parm: NVreg_VMallocHeapMaxSize:int 2025-05-07T20:22:59.5364872Z parm: NVreg_IgnoreMMIOCheck:int 2025-05-07T20:22:59.5365209Z parm: NVreg_NvLinkDisable:int 2025-05-07T20:22:59.5365594Z parm: NVreg_EnablePCIERelaxedOrderingMode:int 2025-05-07T20:22:59.5365995Z parm: NVreg_RegisterPCIDriver:int 2025-05-07T20:22:59.5366349Z parm: NVreg_EnableResizableBar:int 2025-05-07T20:22:59.5366739Z parm: NVreg_EnableDbgBreakpoint:int 2025-05-07T20:22:59.5367082Z parm: NVreg_EnableNonblockingOpen:int 2025-05-07T20:22:59.5367506Z parm: NVreg_RegistryDwords:charp 2025-05-07T20:22:59.5367835Z parm: NVreg_RegistryDwordsPerDevice:charp 2025-05-07T20:22:59.5368161Z parm: NVreg_RmMsg:charp 2025-05-07T20:22:59.5368450Z parm: NVreg_GpuBlacklist:charp 2025-05-07T20:22:59.5368763Z parm: NVreg_TemporaryFilePath:charp 2025-05-07T20:22:59.5369084Z parm: NVreg_ExcludedGpus:charp 2025-05-07T20:22:59.5369398Z parm: NVreg_DmaRemapPeerMmio:int 2025-05-07T20:22:59.5369722Z parm: NVreg_RmNvlinkBandwidth:charp 2025-05-07T20:22:59.5370065Z parm: NVreg_RmNvlinkBandwidthLinkCount:int 2025-05-07T20:22:59.5370405Z parm: NVreg_ImexChannelCount:int 2025-05-07T20:22:59.5370733Z parm: NVreg_CreateImexChannel0:int 2025-05-07T20:22:59.5371070Z parm: NVreg_GrdmaPciTopoCheckOverride:int 2025-05-07T20:22:59.5371404Z parm: rm_firmware_active:charp 2025-05-07T20:22:59.5371677Z + set +e 2025-05-07T20:22:59.5371869Z + nvidia-smi 2025-05-07T20:23:00.9249800Z Wed May 7 20:23:00 2025 2025-05-07T20:23:00.9250192Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:00.9250702Z | NVIDIA-SMI 570.133.07 Driver Version: 570.133.07 CUDA Version: 12.8 | 2025-05-07T20:23:00.9251180Z |-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:23:00.9251654Z | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | 2025-05-07T20:23:00.9252172Z | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | 2025-05-07T20:23:00.9252594Z | | | MIG M. | 2025-05-07T20:23:00.9252927Z |=========================================+========================+======================| 2025-05-07T20:23:00.9313630Z | 0 NVIDIA A10G Off | 00000000:00:1E.0 Off | 0 | 2025-05-07T20:23:00.9314091Z | 0% 31C P0 63W / 300W | 0MiB / 23028MiB | 4% Default | 2025-05-07T20:23:00.9314473Z | | | N/A | 2025-05-07T20:23:00.9314864Z +-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:23:00.9315267Z 2025-05-07T20:23:00.9315657Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:00.9316075Z | Processes: | 2025-05-07T20:23:00.9316507Z | GPU GI CI PID Type Process name GPU Memory | 2025-05-07T20:23:00.9316906Z | ID ID Usage | 2025-05-07T20:23:00.9317249Z |=========================================================================================| 2025-05-07T20:23:00.9318840Z | No running processes found | 2025-05-07T20:23:00.9320116Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:01.3508248Z + nvidia-smi --query-gpu=gpu_name --format=csv,noheader --id=0 2025-05-07T20:23:02.7367488Z NVIDIA A10G 2025-05-07T20:23:03.0051507Z + NVIDIA_SMI_STATUS=0 2025-05-07T20:23:03.0051897Z + '[' 0 -eq 0 ']' 2025-05-07T20:23:03.0052169Z + echo 'INFO: Ignoring allowed status 0' 2025-05-07T20:23:03.0052453Z + set -e 2025-05-07T20:23:03.0052666Z INFO: Ignoring allowed status 0 2025-05-07T20:23:03.0061119Z == Installing nvidia container toolkit for amzn2023 == 2025-05-07T20:23:03.0064802Z + sudo yum install -y yum-utils 2025-05-07T20:23:03.4237641Z Last metadata expiration check: 0:05:51 ago on Wed May 7 20:17:12 2025. 2025-05-07T20:23:03.4482681Z Package dnf-utils-4.3.0-13.amzn2023.0.5.noarch is already installed. 2025-05-07T20:23:03.4878138Z Dependencies resolved. 2025-05-07T20:23:03.5060270Z Nothing to do. 2025-05-07T20:23:03.5060620Z Complete! 2025-05-07T20:23:03.5440794Z + [[ amzn2023 == \a\m\z\n\2\0\2\3 ]] 2025-05-07T20:23:03.5441612Z + YUM_REPO_URL=https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo 2025-05-07T20:23:03.5442589Z + sudo yum-config-manager --add-repo https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo 2025-05-07T20:23:03.8672226Z Adding repo from: https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo 2025-05-07T20:23:03.9243714Z + sudo yum install -y nvidia-docker2 nvidia-container-toolkit-1.16.2 2025-05-07T20:23:04.4486912Z nvidia-container-toolkit 14 kB/s | 833 B 00:00 2025-05-07T20:23:04.4734485Z Package nvidia-docker2-2.14.0-1.noarch is already installed. 2025-05-07T20:23:04.5137764Z Dependencies resolved. 2025-05-07T20:23:04.5318079Z ================================================================================ 2025-05-07T20:23:04.5318498Z Package Arch Version Repository Size 2025-05-07T20:23:04.5318899Z ================================================================================ 2025-05-07T20:23:04.5319203Z Downgrading: 2025-05-07T20:23:04.5319559Z nvidia-container-toolkit x86_64 1.16.2-1 nvidia-container-toolkit 1.2 M 2025-05-07T20:23:04.5320139Z nvidia-container-toolkit-base x86_64 1.16.2-1 nvidia-container-toolkit 5.6 M 2025-05-07T20:23:04.5320493Z 2025-05-07T20:23:04.5320588Z Transaction Summary 2025-05-07T20:23:04.5320835Z ================================================================================ 2025-05-07T20:23:04.5321134Z Downgrade 2 Packages 2025-05-07T20:23:04.5321286Z 2025-05-07T20:23:04.5321404Z Total download size: 6.8 M 2025-05-07T20:23:04.5322718Z Downloading Packages: 2025-05-07T20:23:04.5818539Z (1/2): nvidia-container-toolkit-1.16.2-1.x86_64 26 MB/s | 1.2 MB 00:00 2025-05-07T20:23:04.6212396Z (2/2): nvidia-container-toolkit-base-1.16.2-1.x 64 MB/s | 5.6 MB 00:00 2025-05-07T20:23:04.6221406Z -------------------------------------------------------------------------------- 2025-05-07T20:23:04.6224791Z Total 76 MB/s | 6.8 MB 00:00 2025-05-07T20:23:04.6227756Z Running transaction check 2025-05-07T20:23:04.6332050Z Transaction check succeeded. 2025-05-07T20:23:04.6332688Z Running transaction test 2025-05-07T20:23:04.6626386Z Transaction test succeeded. 2025-05-07T20:23:04.6630242Z Running transaction 2025-05-07T20:23:05.2080547Z Preparing : 1/1 2025-05-07T20:23:05.3130864Z Downgrading : nvidia-container-toolkit-base-1.16.2-1.x86_64 1/4 2025-05-07T20:23:05.3154954Z Downgrading : nvidia-container-toolkit-1.16.2-1.x86_64 2/4 2025-05-07T20:23:05.3371580Z Running scriptlet: nvidia-container-toolkit-1.16.2-1.x86_64 2/4 2025-05-07T20:23:05.3372351Z Cleanup : nvidia-container-toolkit-1.17.6-1.x86_64 3/4 2025-05-07T20:23:05.3475354Z Running scriptlet: nvidia-container-toolkit-1.17.6-1.x86_64 3/4 2025-05-07T20:23:05.3498422Z Cleanup : nvidia-container-toolkit-base-1.17.6-1.x86_64 4/4 2025-05-07T20:23:06.7631672Z Running scriptlet: nvidia-container-toolkit-1.16.2-1.x86_64 4/4 2025-05-07T20:23:06.7632479Z Verifying : nvidia-container-toolkit-1.16.2-1.x86_64 1/4 2025-05-07T20:23:06.7633212Z Verifying : nvidia-container-toolkit-1.17.6-1.x86_64 2/4 2025-05-07T20:23:06.7633845Z Verifying : nvidia-container-toolkit-base-1.16.2-1.x86_64 3/4 2025-05-07T20:23:06.8987044Z Verifying : nvidia-container-toolkit-base-1.17.6-1.x86_64 4/4================================================================================ 2025-05-07T20:23:06.8988952Z WARNING: 2025-05-07T20:23:06.8989432Z A newer release of "Amazon Linux" is available. 2025-05-07T20:23:06.8990427Z 2025-05-07T20:23:06.8990616Z Available Versions: 2025-05-07T20:23:06.8990922Z 2025-05-07T20:23:06.8991118Z Version 2023.7.20250331: 2025-05-07T20:23:06.8991520Z Run the following command to upgrade to 2023.7.20250331: 2025-05-07T20:23:06.8991793Z 2025-05-07T20:23:06.8991914Z dnf upgrade --releasever=2023.7.20250331 2025-05-07T20:23:06.8992128Z 2025-05-07T20:23:06.8992214Z Release notes: 2025-05-07T20:23:06.8992626Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html 2025-05-07T20:23:06.8992997Z 2025-05-07T20:23:06.8993097Z Version 2023.7.20250414: 2025-05-07T20:23:06.8993398Z Run the following command to upgrade to 2023.7.20250414: 2025-05-07T20:23:06.8993700Z 2025-05-07T20:23:06.8993850Z dnf upgrade --releasever=2023.7.20250414 2025-05-07T20:23:06.8994092Z 2025-05-07T20:23:06.8994252Z Release notes: 2025-05-07T20:23:06.8994912Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html 2025-05-07T20:23:06.8995339Z 2025-05-07T20:23:06.8995467Z Version 2023.7.20250428: 2025-05-07T20:23:06.8995856Z Run the following command to upgrade to 2023.7.20250428: 2025-05-07T20:23:06.8996190Z 2025-05-07T20:23:06.8996377Z dnf upgrade --releasever=2023.7.20250428 2025-05-07T20:23:06.8996651Z 2025-05-07T20:23:06.8996767Z Release notes: 2025-05-07T20:23:06.8997234Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html 2025-05-07T20:23:06.9008849Z 2025-05-07T20:23:06.9008974Z ================================================================================ 2025-05-07T20:23:06.9353767Z 2025-05-07T20:23:06.9354133Z 2025-05-07T20:23:06.9354288Z Downgraded: 2025-05-07T20:23:06.9354781Z nvidia-container-toolkit-1.16.2-1.x86_64 2025-05-07T20:23:06.9355542Z nvidia-container-toolkit-base-1.16.2-1.x86_64 2025-05-07T20:23:06.9356008Z 2025-05-07T20:23:06.9356112Z Complete! 2025-05-07T20:23:06.9835544Z + sudo systemctl restart docker 2025-05-07T20:23:10.8644770Z Wed May 7 20:23:10 2025 2025-05-07T20:23:10.8645339Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:10.8645879Z | NVIDIA-SMI 570.133.07 Driver Version: 570.133.07 CUDA Version: 12.8 | 2025-05-07T20:23:10.8646374Z |-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:23:10.8646867Z | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | 2025-05-07T20:23:10.8647377Z | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | 2025-05-07T20:23:10.8647805Z | | | MIG M. | 2025-05-07T20:23:10.8648140Z |=========================================+========================+======================| 2025-05-07T20:23:10.8728819Z | 0 NVIDIA A10G On | 00000000:00:1E.0 Off | 0 | 2025-05-07T20:23:10.8730595Z | 0% 31C P0 63W / 300W | 0MiB / 23028MiB | 4% Default | 2025-05-07T20:23:10.8731362Z | | | N/A | 2025-05-07T20:23:10.8732122Z +-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:23:10.8732924Z 2025-05-07T20:23:10.8733462Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:10.8734003Z | Processes: | 2025-05-07T20:23:10.8734440Z | GPU GI CI PID Type Process name GPU Memory | 2025-05-07T20:23:10.8735030Z | ID ID Usage | 2025-05-07T20:23:10.8735376Z |=========================================================================================| 2025-05-07T20:23:10.8735805Z | No running processes found | 2025-05-07T20:23:10.8736274Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:11.9383031Z Command completed after 1 attempt(s). 2025-05-07T20:23:11.9471814Z ##[group]Run . $PRELUDE; print_system_info; print_ec2_info 2025-05-07T20:23:11.9472279Z . $PRELUDE; print_system_info; print_ec2_info 2025-05-07T20:23:11.9486912Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:23:11.9487262Z env: 2025-05-07T20:23:11.9487488Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:23:11.9487791Z BUILD_ENV: build_binary 2025-05-07T20:23:11.9488035Z BUILD_TARGET: genai 2025-05-07T20:23:11.9488273Z BUILD_VARIANT: cuda 2025-05-07T20:23:11.9488503Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:23:11.9488756Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:23:11.9489057Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:23:11.9489381Z ##[endgroup] 2025-05-07T20:23:12.2837955Z ################################################################################ 2025-05-07T20:23:12.2838315Z # Print System Info 2025-05-07T20:23:12.2838541Z # 2025-05-07T20:23:12.2853119Z # [2025-05-07T20:23:12.284Z] + print_system_info 2025-05-07T20:23:12.2853480Z ################################################################################ 2025-05-07T20:23:12.2853692Z 2025-05-07T20:23:12.2853806Z ################################################################################ 2025-05-07T20:23:12.2854137Z [INFO] Printing environment variables ... 2025-05-07T20:23:12.2854436Z + printenv 2025-05-07T20:23:12.2854550Z 2025-05-07T20:23:12.2877204Z SHELL=/bin/bash 2025-05-07T20:23:12.2877699Z GITHUB_WORKSPACE=/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM 2025-05-07T20:23:12.2878129Z BUILD_VARIANT=cuda 2025-05-07T20:23:12.2878658Z GITHUB_PATH=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/add_path_4a7d9ed4-eb36-47db-9632-4aa240a026c5 2025-05-07T20:23:12.2879269Z GITHUB_ACTION=__run 2025-05-07T20:23:12.2879560Z GPU_FLAG=--gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:23:12.2879898Z GITHUB_RUN_NUMBER=10601 2025-05-07T20:23:12.2880147Z RUNNER_NAME=i-0c2643f2bcfaf5e6b 2025-05-07T20:23:12.2880436Z GITHUB_REPOSITORY_OWNER_ID=21003710 2025-05-07T20:23:12.2880732Z PLATFORM_NAME_LC=linux-x86_64 2025-05-07T20:23:12.2881000Z MACHINE_NAME_LC=x86_64 2025-05-07T20:23:12.2881366Z ACTIONS_RUNNER_HOOK_JOB_COMPLETED=/home/ec2-user/runner-scripts/after_job.sh 2025-05-07T20:23:12.2881784Z GITHUB_TRIGGERING_ACTOR=q10 2025-05-07T20:23:12.2882062Z PRELUDE=.github/scripts/setup_env.bash 2025-05-07T20:23:12.2882356Z GITHUB_REF_TYPE=branch 2025-05-07T20:23:12.2882998Z *** 2025-05-07T20:23:12.2883195Z LOGNAME=ec2-user 2025-05-07T20:23:12.2883443Z GITHUB_REPOSITORY_ID=150154628 2025-05-07T20:23:12.2883708Z ENFORCE_CUDA_DEVICE=1 2025-05-07T20:23:12.2883934Z GITHUB_ACTIONS=true 2025-05-07T20:23:12.2884160Z SYSTEMD_EXEC_PID=55541 2025-05-07T20:23:12.2884444Z GITHUB_SHA=a2f4c52051596e74bc8c16e3d2867a4ecdd271e0 2025-05-07T20:23:12.2884980Z GITHUB_WORKFLOW_REF=pytorch/FBGEMM/.github/workflows/fbgemm_gpu_ci_cuda.yml@refs/pull/4066/merge 2025-05-07T20:23:12.2885487Z RUNNER_ENVIRONMENT=self-hosted 2025-05-07T20:23:12.2885773Z GITHUB_REF=refs/pull/4066/merge 2025-05-07T20:23:12.2886025Z RUNNER_OS=Linux 2025-05-07T20:23:12.2886248Z GITHUB_REF_PROTECTED=false 2025-05-07T20:23:12.2886491Z HOME=/home/ec2-user 2025-05-07T20:23:12.2886735Z GITHUB_API_URL=https://api.github.com 2025-05-07T20:23:12.2887023Z LANG=C.UTF-8 2025-05-07T20:23:12.2887318Z RUNNER_TRACKING_ID=github_aa71c52e-5c10-4a56-a421-f206faa9b39e 2025-05-07T20:23:12.2887671Z RUNNER_ARCH=X64 2025-05-07T20:23:12.2887937Z RUNNER_TEMP=/home/ec2-user/actions-runner/_work/_temp 2025-05-07T20:23:12.2888537Z BUILD_TARGET=genai 2025-05-07T20:23:12.2889068Z GITHUB_STATE=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/save_state_4a7d9ed4-eb36-47db-9632-4aa240a026c5 2025-05-07T20:23:12.2890298Z GITHUB_ENV=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/set_env_4a7d9ed4-eb36-47db-9632-4aa240a026c5 2025-05-07T20:23:12.2891036Z GITHUB_EVENT_PATH=/home/ec2-user/actions-runner/_work/_temp/_github_workflow/event.json 2025-05-07T20:23:12.2891730Z INVOCATION_ID=dad162b31b1f499cb44ecb48a70cac1d 2025-05-07T20:23:12.2892064Z GITHUB_EVENT_NAME=pull_request 2025-05-07T20:23:12.2892318Z GITHUB_RUN_ID=14891846252 2025-05-07T20:23:12.2892892Z GITHUB_STEP_SUMMARY=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/step_summary_4a7d9ed4-eb36-47db-9632-4aa240a026c5 2025-05-07T20:23:12.2893504Z BUILD_ENV=build_binary 2025-05-07T20:23:12.2893726Z GITHUB_ACTOR=q10 2025-05-07T20:23:12.2893943Z GITHUB_RUN_ATTEMPT=1 2025-05-07T20:23:12.2894169Z KERN_NAME_LC=linux 2025-05-07T20:23:12.2894393Z BUILD_CUDA_VERSION=12.6.3 2025-05-07T20:23:12.2894690Z GITHUB_GRAPHQL_URL=https://api.github.com/graphql 2025-05-07T20:23:12.2895030Z PLATFORM_NAME=Linux-x86_64 2025-05-07T20:23:12.2895267Z USER=ec2-user 2025-05-07T20:23:12.2895501Z GITHUB_SERVER_URL=https://github.com 2025-05-07T20:23:12.2895780Z SHLVL=1 2025-05-07T20:23:12.2895971Z GITHUB_ACTOR_ID=255046 2025-05-07T20:23:12.2896284Z RUNNER_TOOL_CACHE=/home/ec2-user/actions-runner/_work/_tool 2025-05-07T20:23:12.2896740Z GITHUB_WORKFLOW_SHA=6060cd4b5f971680caecdcc657faccb5720d1c3e 2025-05-07T20:23:12.2897090Z GITHUB_REF_NAME=4066/merge 2025-05-07T20:23:12.2897326Z KERN_NAME=Linux 2025-05-07T20:23:12.2897551Z GITHUB_JOB=test_and_publish_artifact 2025-05-07T20:23:12.2897944Z ACTIONS_RUNNER_HOOK_JOB_STARTED=/home/ec2-user/runner-scripts/before_job.sh 2025-05-07T20:23:12.2898366Z GITHUB_REPOSITORY=pytorch/FBGEMM 2025-05-07T20:23:12.2898640Z GITHUB_RETENTION_DAYS=90 2025-05-07T20:23:12.2898875Z JOURNAL_STREAM=8:81829 2025-05-07T20:23:12.2899192Z RUNNER_WORKSPACE=/home/ec2-user/actions-runner/_work/FBGEMM 2025-05-07T20:23:12.2899556Z GITHUB_ACTION_REPOSITORY= 2025-05-07T20:23:12.2899985Z PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin 2025-05-07T20:23:12.2900320Z GITHUB_BASE_REF=main 2025-05-07T20:23:12.2900539Z CI=true 2025-05-07T20:23:12.2900752Z GITHUB_REPOSITORY_OWNER=pytorch 2025-05-07T20:23:12.2901024Z GITHUB_HEAD_REF=bm/genai-rocm-oss-6 2025-05-07T20:23:12.2901296Z GITHUB_ACTION_REF= 2025-05-07T20:23:12.2901541Z GITHUB_WORKFLOW=FBGEMM GPU/GenAI CUDA CI 2025-05-07T20:23:12.2902135Z GITHUB_OUTPUT=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/set_output_4a7d9ed4-eb36-47db-9632-4aa240a026c5 2025-05-07T20:23:12.2902711Z MACHINE_NAME=x86_64 2025-05-07T20:23:12.2902928Z _=/usr/bin/printenv 2025-05-07T20:23:12.2903059Z 2025-05-07T20:23:12.2903192Z ################################################################################ 2025-05-07T20:23:12.2903514Z [INFO] Print ldd version ... 2025-05-07T20:23:12.2903773Z + ldd --version 2025-05-07T20:23:12.2903900Z 2025-05-07T20:23:12.2903992Z ldd (GNU libc) 2.34 2025-05-07T20:23:12.2904255Z Copyright (C) 2021 Free Software Foundation, Inc. 2025-05-07T20:23:12.2904692Z This is free software; see the source for copying conditions. There is NO 2025-05-07T20:23:12.2905217Z warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. 2025-05-07T20:23:12.2905655Z Written by Roland McGrath and Ulrich Drepper. 2025-05-07T20:23:12.2905877Z 2025-05-07T20:23:12.2905990Z ################################################################################ 2025-05-07T20:23:12.2906298Z [INFO] Print CPU info ... 2025-05-07T20:23:12.2906534Z + nproc 2025-05-07T20:23:12.2906641Z 2025-05-07T20:23:12.2920148Z 16 2025-05-07T20:23:12.2921884Z 2025-05-07T20:23:12.2922540Z + lscpu 2025-05-07T20:23:12.2922701Z 2025-05-07T20:23:12.3035071Z Architecture: x86_64 2025-05-07T20:23:12.3035482Z CPU op-mode(s): 32-bit, 64-bit 2025-05-07T20:23:12.3036296Z Address sizes: 48 bits physical, 48 bits virtual 2025-05-07T20:23:12.3036697Z Byte Order: Little Endian 2025-05-07T20:23:12.3037008Z CPU(s): 16 2025-05-07T20:23:12.3037311Z On-line CPU(s) list: 0-15 2025-05-07T20:23:12.3037639Z Vendor ID: AuthenticAMD 2025-05-07T20:23:12.3037988Z Model name: AMD EPYC 7R32 2025-05-07T20:23:12.3038301Z CPU family: 23 2025-05-07T20:23:12.3038925Z Model: 49 2025-05-07T20:23:12.3039255Z Thread(s) per core: 2 2025-05-07T20:23:12.3039560Z Core(s) per socket: 8 2025-05-07T20:23:12.3039851Z Socket(s): 1 2025-05-07T20:23:12.3040140Z Stepping: 0 2025-05-07T20:23:12.3040449Z BogoMIPS: 5599.62 2025-05-07T20:23:12.3042507Z Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:12.3044614Z Hypervisor vendor: KVM 2025-05-07T20:23:12.3044917Z Virtualization type: full 2025-05-07T20:23:12.3045257Z L1d cache: 256 KiB (8 instances) 2025-05-07T20:23:12.3045620Z L1i cache: 256 KiB (8 instances) 2025-05-07T20:23:12.3045973Z L2 cache: 4 MiB (8 instances) 2025-05-07T20:23:12.3046324Z L3 cache: 32 MiB (2 instances) 2025-05-07T20:23:12.3046653Z NUMA node(s): 1 2025-05-07T20:23:12.3046939Z NUMA node0 CPU(s): 0-15 2025-05-07T20:23:12.3047272Z Vulnerability Gather data sampling: Not affected 2025-05-07T20:23:12.3047643Z Vulnerability Itlb multihit: Not affected 2025-05-07T20:23:12.3048002Z Vulnerability L1tf: Not affected 2025-05-07T20:23:12.3048347Z Vulnerability Mds: Not affected 2025-05-07T20:23:12.3048705Z Vulnerability Meltdown: Not affected 2025-05-07T20:23:12.3049062Z Vulnerability Mmio stale data: Not affected 2025-05-07T20:23:12.3049418Z Vulnerability Reg file data sampling: Not affected 2025-05-07T20:23:12.3049967Z Vulnerability Retbleed: Mitigation; untrained return thunk; SMT enabled with STIBP protection 2025-05-07T20:23:12.3050563Z Vulnerability Spec rstack overflow: Mitigation; safe RET 2025-05-07T20:23:12.3051116Z Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl 2025-05-07T20:23:12.3051829Z Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization 2025-05-07T20:23:12.3052725Z Vulnerability Spectre v2: Mitigation; Retpolines; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected 2025-05-07T20:23:12.3053414Z Vulnerability Srbds: Not affected 2025-05-07T20:23:12.3053781Z Vulnerability Tsx async abort: Not affected 2025-05-07T20:23:12.3054011Z 2025-05-07T20:23:12.3054112Z + cat /proc/cpuinfo 2025-05-07T20:23:12.3054249Z 2025-05-07T20:23:12.3054421Z processor : 0 2025-05-07T20:23:12.3054636Z vendor_id : AuthenticAMD 2025-05-07T20:23:12.3054891Z cpu family : 23 2025-05-07T20:23:12.3055107Z model : 49 2025-05-07T20:23:12.3055318Z model name : AMD EPYC 7R32 2025-05-07T20:23:12.3055573Z stepping : 0 2025-05-07T20:23:12.3055790Z microcode : 0x830107f 2025-05-07T20:23:12.3056104Z cpu MHz : 2862.187 2025-05-07T20:23:12.3056321Z cache size : 512 KB 2025-05-07T20:23:12.3056535Z physical id : 0 2025-05-07T20:23:12.3056740Z siblings : 16 2025-05-07T20:23:12.3056942Z core id : 0 2025-05-07T20:23:12.3057146Z cpu cores : 8 2025-05-07T20:23:12.3057342Z apicid : 0 2025-05-07T20:23:12.3057545Z initial apicid : 0 2025-05-07T20:23:12.3057756Z fpu : yes 2025-05-07T20:23:12.3057949Z fpu_exception : yes 2025-05-07T20:23:12.3058165Z cpuid level : 13 2025-05-07T20:23:12.3058378Z wp : yes 2025-05-07T20:23:12.3060486Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:12.3062712Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:12.3063192Z bogomips : 5599.62 2025-05-07T20:23:12.3063415Z TLB size : 3072 4K pages 2025-05-07T20:23:12.3063652Z clflush size : 64 2025-05-07T20:23:12.3063864Z cache_alignment : 64 2025-05-07T20:23:12.3064135Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:12.3064456Z power management: 2025-05-07T20:23:12.3064588Z 2025-05-07T20:23:12.3064674Z processor : 1 2025-05-07T20:23:12.3064895Z vendor_id : AuthenticAMD 2025-05-07T20:23:12.3065138Z cpu family : 23 2025-05-07T20:23:12.3065347Z model : 49 2025-05-07T20:23:12.3065557Z model name : AMD EPYC 7R32 2025-05-07T20:23:12.3065802Z stepping : 0 2025-05-07T20:23:12.3066009Z microcode : 0x830107f 2025-05-07T20:23:12.3066235Z cpu MHz : 3299.185 2025-05-07T20:23:12.3066449Z cache size : 512 KB 2025-05-07T20:23:12.3066671Z physical id : 0 2025-05-07T20:23:12.3066881Z siblings : 16 2025-05-07T20:23:12.3067083Z core id : 1 2025-05-07T20:23:12.3067276Z cpu cores : 8 2025-05-07T20:23:12.3067483Z apicid : 2 2025-05-07T20:23:12.3067681Z initial apicid : 2 2025-05-07T20:23:12.3067894Z fpu : yes 2025-05-07T20:23:12.3068100Z fpu_exception : yes 2025-05-07T20:23:12.3068320Z cpuid level : 13 2025-05-07T20:23:12.3068528Z wp : yes 2025-05-07T20:23:12.3070453Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:12.3072634Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:12.3073118Z bogomips : 5599.62 2025-05-07T20:23:12.3073338Z TLB size : 3072 4K pages 2025-05-07T20:23:12.3073565Z clflush size : 64 2025-05-07T20:23:12.3073781Z cache_alignment : 64 2025-05-07T20:23:12.3074051Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:12.3074358Z power management: 2025-05-07T20:23:12.3074496Z 2025-05-07T20:23:12.3074581Z processor : 2 2025-05-07T20:23:12.3074800Z vendor_id : AuthenticAMD 2025-05-07T20:23:12.3075031Z cpu family : 23 2025-05-07T20:23:12.3075242Z model : 49 2025-05-07T20:23:12.3075449Z model name : AMD EPYC 7R32 2025-05-07T20:23:12.3075681Z stepping : 0 2025-05-07T20:23:12.3075890Z microcode : 0x830107f 2025-05-07T20:23:12.3076116Z cpu MHz : 3299.868 2025-05-07T20:23:12.3076329Z cache size : 512 KB 2025-05-07T20:23:12.3076536Z physical id : 0 2025-05-07T20:23:12.3076744Z siblings : 16 2025-05-07T20:23:12.3077029Z core id : 2 2025-05-07T20:23:12.3077220Z cpu cores : 8 2025-05-07T20:23:12.3077416Z apicid : 4 2025-05-07T20:23:12.3077612Z initial apicid : 4 2025-05-07T20:23:12.3077815Z fpu : yes 2025-05-07T20:23:12.3078013Z fpu_exception : yes 2025-05-07T20:23:12.3078231Z cpuid level : 13 2025-05-07T20:23:12.3078433Z wp : yes 2025-05-07T20:23:12.3080425Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:12.3082606Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:12.3083092Z bogomips : 5599.62 2025-05-07T20:23:12.3083306Z TLB size : 3072 4K pages 2025-05-07T20:23:12.3083546Z clflush size : 64 2025-05-07T20:23:12.3083759Z cache_alignment : 64 2025-05-07T20:23:12.3084021Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:12.3084335Z power management: 2025-05-07T20:23:12.3084473Z 2025-05-07T20:23:12.3084555Z processor : 3 2025-05-07T20:23:12.3084771Z vendor_id : AuthenticAMD 2025-05-07T20:23:12.3085011Z cpu family : 23 2025-05-07T20:23:12.3085220Z model : 49 2025-05-07T20:23:12.3085427Z model name : AMD EPYC 7R32 2025-05-07T20:23:12.3085665Z stepping : 0 2025-05-07T20:23:12.3085878Z microcode : 0x830107f 2025-05-07T20:23:12.3086108Z cpu MHz : 3255.951 2025-05-07T20:23:12.3086323Z cache size : 512 KB 2025-05-07T20:23:12.3086535Z physical id : 0 2025-05-07T20:23:12.3086740Z siblings : 16 2025-05-07T20:23:12.3086940Z core id : 3 2025-05-07T20:23:12.3087136Z cpu cores : 8 2025-05-07T20:23:12.3087335Z apicid : 6 2025-05-07T20:23:12.3087534Z initial apicid : 6 2025-05-07T20:23:12.3087750Z fpu : yes 2025-05-07T20:23:12.3087945Z fpu_exception : yes 2025-05-07T20:23:12.3088159Z cpuid level : 13 2025-05-07T20:23:12.3088368Z wp : yes 2025-05-07T20:23:12.3090572Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:12.3092758Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:12.3093263Z bogomips : 5599.62 2025-05-07T20:23:12.3093478Z TLB size : 3072 4K pages 2025-05-07T20:23:12.3093713Z clflush size : 64 2025-05-07T20:23:12.3093929Z cache_alignment : 64 2025-05-07T20:23:12.3094192Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:12.3094507Z power management: 2025-05-07T20:23:12.3094636Z 2025-05-07T20:23:12.3094723Z processor : 4 2025-05-07T20:23:12.3094930Z vendor_id : AuthenticAMD 2025-05-07T20:23:12.3095164Z cpu family : 23 2025-05-07T20:23:12.3095369Z model : 49 2025-05-07T20:23:12.3095574Z model name : AMD EPYC 7R32 2025-05-07T20:23:12.3095819Z stepping : 0 2025-05-07T20:23:12.3096029Z microcode : 0x830107f 2025-05-07T20:23:12.3096255Z cpu MHz : 3145.190 2025-05-07T20:23:12.3096462Z cache size : 512 KB 2025-05-07T20:23:12.3096678Z physical id : 0 2025-05-07T20:23:12.3096886Z siblings : 16 2025-05-07T20:23:12.3097080Z core id : 4 2025-05-07T20:23:12.3097277Z cpu cores : 8 2025-05-07T20:23:12.3097473Z apicid : 8 2025-05-07T20:23:12.3097811Z initial apicid : 8 2025-05-07T20:23:12.3112021Z fpu : yes 2025-05-07T20:23:12.3112250Z fpu_exception : yes 2025-05-07T20:23:12.3112491Z cpuid level : 13 2025-05-07T20:23:12.3112705Z wp : yes 2025-05-07T20:23:12.3114871Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:12.3117078Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:12.3117574Z bogomips : 5599.62 2025-05-07T20:23:12.3117800Z TLB size : 3072 4K pages 2025-05-07T20:23:12.3118057Z clflush size : 64 2025-05-07T20:23:12.3118284Z cache_alignment : 64 2025-05-07T20:23:12.3118556Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:12.3118886Z power management: 2025-05-07T20:23:12.3119035Z 2025-05-07T20:23:12.3119123Z processor : 5 2025-05-07T20:23:12.3119349Z vendor_id : AuthenticAMD 2025-05-07T20:23:12.3119591Z cpu family : 23 2025-05-07T20:23:12.3119806Z model : 49 2025-05-07T20:23:12.3120026Z model name : AMD EPYC 7R32 2025-05-07T20:23:12.3120274Z stepping : 0 2025-05-07T20:23:12.3120492Z microcode : 0x830107f 2025-05-07T20:23:12.3120731Z cpu MHz : 2241.648 2025-05-07T20:23:12.3120949Z cache size : 512 KB 2025-05-07T20:23:12.3121174Z physical id : 0 2025-05-07T20:23:12.3121392Z siblings : 16 2025-05-07T20:23:12.3121594Z core id : 5 2025-05-07T20:23:12.3121804Z cpu cores : 8 2025-05-07T20:23:12.3122016Z apicid : 10 2025-05-07T20:23:12.3122221Z initial apicid : 10 2025-05-07T20:23:12.3122441Z fpu : yes 2025-05-07T20:23:12.3122656Z fpu_exception : yes 2025-05-07T20:23:12.3122875Z cpuid level : 13 2025-05-07T20:23:12.3123091Z wp : yes 2025-05-07T20:23:12.3125022Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:12.3127213Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:12.3127716Z bogomips : 5599.62 2025-05-07T20:23:12.3127943Z TLB size : 3072 4K pages 2025-05-07T20:23:12.3128191Z clflush size : 64 2025-05-07T20:23:12.3128419Z cache_alignment : 64 2025-05-07T20:23:12.3128688Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:12.3129008Z power management: 2025-05-07T20:23:12.3129143Z 2025-05-07T20:23:12.3129234Z processor : 6 2025-05-07T20:23:12.3129449Z vendor_id : AuthenticAMD 2025-05-07T20:23:12.3129691Z cpu family : 23 2025-05-07T20:23:12.3129902Z model : 49 2025-05-07T20:23:12.3130111Z model name : AMD EPYC 7R32 2025-05-07T20:23:12.3130356Z stepping : 0 2025-05-07T20:23:12.3130566Z microcode : 0x830107f 2025-05-07T20:23:12.3130789Z cpu MHz : 3295.045 2025-05-07T20:23:12.3131007Z cache size : 512 KB 2025-05-07T20:23:12.3131226Z physical id : 0 2025-05-07T20:23:12.3131429Z siblings : 16 2025-05-07T20:23:12.3131635Z core id : 6 2025-05-07T20:23:12.3131839Z cpu cores : 8 2025-05-07T20:23:12.3132033Z apicid : 12 2025-05-07T20:23:12.3132250Z initial apicid : 12 2025-05-07T20:23:12.3132466Z fpu : yes 2025-05-07T20:23:12.3132662Z fpu_exception : yes 2025-05-07T20:23:12.3132887Z cpuid level : 13 2025-05-07T20:23:12.3133202Z wp : yes 2025-05-07T20:23:12.3135208Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:12.3137439Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:12.3137916Z bogomips : 5599.62 2025-05-07T20:23:12.3138144Z TLB size : 3072 4K pages 2025-05-07T20:23:12.3138385Z clflush size : 64 2025-05-07T20:23:12.3138600Z cache_alignment : 64 2025-05-07T20:23:12.3138874Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:12.3139199Z power management: 2025-05-07T20:23:12.3139333Z 2025-05-07T20:23:12.3139416Z processor : 7 2025-05-07T20:23:12.3139630Z vendor_id : AuthenticAMD 2025-05-07T20:23:12.3139962Z cpu family : 23 2025-05-07T20:23:12.3140166Z model : 49 2025-05-07T20:23:12.3140380Z model name : AMD EPYC 7R32 2025-05-07T20:23:12.3140622Z stepping : 0 2025-05-07T20:23:12.3140828Z microcode : 0x830107f 2025-05-07T20:23:12.3141057Z cpu MHz : 2138.754 2025-05-07T20:23:12.3141282Z cache size : 512 KB 2025-05-07T20:23:12.3141505Z physical id : 0 2025-05-07T20:23:12.3141715Z siblings : 16 2025-05-07T20:23:12.3141920Z core id : 7 2025-05-07T20:23:12.3142122Z cpu cores : 8 2025-05-07T20:23:12.3142324Z apicid : 14 2025-05-07T20:23:12.3142536Z initial apicid : 14 2025-05-07T20:23:12.3142757Z fpu : yes 2025-05-07T20:23:12.3142951Z fpu_exception : yes 2025-05-07T20:23:12.3143173Z cpuid level : 13 2025-05-07T20:23:12.3143383Z wp : yes 2025-05-07T20:23:12.3145296Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:12.3147524Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:12.3148012Z bogomips : 5599.62 2025-05-07T20:23:12.3148238Z TLB size : 3072 4K pages 2025-05-07T20:23:12.3148472Z clflush size : 64 2025-05-07T20:23:12.3148692Z cache_alignment : 64 2025-05-07T20:23:12.3148962Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:12.3149275Z power management: 2025-05-07T20:23:12.3149413Z 2025-05-07T20:23:12.3149494Z processor : 8 2025-05-07T20:23:12.3149713Z vendor_id : AuthenticAMD 2025-05-07T20:23:12.3149952Z cpu family : 23 2025-05-07T20:23:12.3150155Z model : 49 2025-05-07T20:23:12.3150367Z model name : AMD EPYC 7R32 2025-05-07T20:23:12.3150609Z stepping : 0 2025-05-07T20:23:12.3150822Z microcode : 0x830107f 2025-05-07T20:23:12.3151043Z cpu MHz : 3254.760 2025-05-07T20:23:12.3151262Z cache size : 512 KB 2025-05-07T20:23:12.3151478Z physical id : 0 2025-05-07T20:23:12.3151695Z siblings : 16 2025-05-07T20:23:12.3151898Z core id : 0 2025-05-07T20:23:12.3152091Z cpu cores : 8 2025-05-07T20:23:12.3152296Z apicid : 1 2025-05-07T20:23:12.3152496Z initial apicid : 1 2025-05-07T20:23:12.3152698Z fpu : yes 2025-05-07T20:23:12.3152891Z fpu_exception : yes 2025-05-07T20:23:12.3153098Z cpuid level : 13 2025-05-07T20:23:12.3153294Z wp : yes 2025-05-07T20:23:12.3155195Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:12.3157681Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:12.3158159Z bogomips : 5599.62 2025-05-07T20:23:12.3158370Z TLB size : 3072 4K pages 2025-05-07T20:23:12.3158596Z clflush size : 64 2025-05-07T20:23:12.3158796Z cache_alignment : 64 2025-05-07T20:23:12.3159055Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:12.3159368Z power management: 2025-05-07T20:23:12.3159499Z 2025-05-07T20:23:12.3159589Z processor : 9 2025-05-07T20:23:12.3159797Z vendor_id : AuthenticAMD 2025-05-07T20:23:12.3160021Z cpu family : 23 2025-05-07T20:23:12.3160220Z model : 49 2025-05-07T20:23:12.3160413Z model name : AMD EPYC 7R32 2025-05-07T20:23:12.3160645Z stepping : 0 2025-05-07T20:23:12.3160852Z microcode : 0x830107f 2025-05-07T20:23:12.3161068Z cpu MHz : 3299.600 2025-05-07T20:23:12.3161270Z cache size : 512 KB 2025-05-07T20:23:12.3161541Z physical id : 0 2025-05-07T20:23:12.3161747Z siblings : 16 2025-05-07T20:23:12.3161952Z core id : 1 2025-05-07T20:23:12.3162163Z cpu cores : 8 2025-05-07T20:23:12.3162360Z apicid : 3 2025-05-07T20:23:12.3162564Z initial apicid : 3 2025-05-07T20:23:12.3162785Z fpu : yes 2025-05-07T20:23:12.3162984Z fpu_exception : yes 2025-05-07T20:23:12.3163206Z cpuid level : 13 2025-05-07T20:23:12.3163415Z wp : yes 2025-05-07T20:23:12.3165328Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:12.3167523Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:12.3168018Z bogomips : 5599.62 2025-05-07T20:23:12.3168242Z TLB size : 3072 4K pages 2025-05-07T20:23:12.3168481Z clflush size : 64 2025-05-07T20:23:12.3168698Z cache_alignment : 64 2025-05-07T20:23:12.3168973Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:12.3169290Z power management: 2025-05-07T20:23:12.3169426Z 2025-05-07T20:23:12.3169512Z processor : 10 2025-05-07T20:23:12.3169737Z vendor_id : AuthenticAMD 2025-05-07T20:23:12.3169983Z cpu family : 23 2025-05-07T20:23:12.3170187Z model : 49 2025-05-07T20:23:12.3170399Z model name : AMD EPYC 7R32 2025-05-07T20:23:12.3170650Z stepping : 0 2025-05-07T20:23:12.3170858Z microcode : 0x830107f 2025-05-07T20:23:12.3171091Z cpu MHz : 3248.721 2025-05-07T20:23:12.3171313Z cache size : 512 KB 2025-05-07T20:23:12.3171525Z physical id : 0 2025-05-07T20:23:12.3171743Z siblings : 16 2025-05-07T20:23:12.3171952Z core id : 2 2025-05-07T20:23:12.3172147Z cpu cores : 8 2025-05-07T20:23:12.3172353Z apicid : 5 2025-05-07T20:23:12.3172557Z initial apicid : 5 2025-05-07T20:23:12.3172796Z fpu : yes 2025-05-07T20:23:12.3173020Z fpu_exception : yes 2025-05-07T20:23:12.3173256Z cpuid level : 13 2025-05-07T20:23:12.3173460Z wp : yes 2025-05-07T20:23:12.3175371Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:12.3177672Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:12.3178149Z bogomips : 5599.62 2025-05-07T20:23:12.3178476Z TLB size : 3072 4K pages 2025-05-07T20:23:12.3178711Z clflush size : 64 2025-05-07T20:23:12.3178924Z cache_alignment : 64 2025-05-07T20:23:12.3179186Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:12.3179496Z power management: 2025-05-07T20:23:12.3179627Z 2025-05-07T20:23:12.3179715Z processor : 11 2025-05-07T20:23:12.3179988Z vendor_id : AuthenticAMD 2025-05-07T20:23:12.3180225Z cpu family : 23 2025-05-07T20:23:12.3180425Z model : 49 2025-05-07T20:23:12.3180633Z model name : AMD EPYC 7R32 2025-05-07T20:23:12.3180884Z stepping : 0 2025-05-07T20:23:12.3181090Z microcode : 0x830107f 2025-05-07T20:23:12.3181310Z cpu MHz : 3203.262 2025-05-07T20:23:12.3181519Z cache size : 512 KB 2025-05-07T20:23:12.3181729Z physical id : 0 2025-05-07T20:23:12.3181929Z siblings : 16 2025-05-07T20:23:12.3182126Z core id : 3 2025-05-07T20:23:12.3182323Z cpu cores : 8 2025-05-07T20:23:12.3182514Z apicid : 7 2025-05-07T20:23:12.3182711Z initial apicid : 7 2025-05-07T20:23:12.3182941Z fpu : yes 2025-05-07T20:23:12.3183166Z fpu_exception : yes 2025-05-07T20:23:12.3183383Z cpuid level : 13 2025-05-07T20:23:12.3183588Z wp : yes 2025-05-07T20:23:12.3185506Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:12.3187706Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:12.3188183Z bogomips : 5599.62 2025-05-07T20:23:12.3188405Z TLB size : 3072 4K pages 2025-05-07T20:23:12.3188646Z clflush size : 64 2025-05-07T20:23:12.3188859Z cache_alignment : 64 2025-05-07T20:23:12.3189134Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:12.3189445Z power management: 2025-05-07T20:23:12.3189576Z 2025-05-07T20:23:12.3189657Z processor : 12 2025-05-07T20:23:12.3190145Z vendor_id : AuthenticAMD 2025-05-07T20:23:12.3190471Z cpu family : 23 2025-05-07T20:23:12.3190669Z model : 49 2025-05-07T20:23:12.3190871Z model name : AMD EPYC 7R32 2025-05-07T20:23:12.3191122Z stepping : 0 2025-05-07T20:23:12.3191324Z microcode : 0x830107f 2025-05-07T20:23:12.3191556Z cpu MHz : 2993.440 2025-05-07T20:23:12.3191769Z cache size : 512 KB 2025-05-07T20:23:12.3191983Z physical id : 0 2025-05-07T20:23:12.3192187Z siblings : 16 2025-05-07T20:23:12.3192388Z core id : 4 2025-05-07T20:23:12.3192585Z cpu cores : 8 2025-05-07T20:23:12.3192780Z apicid : 9 2025-05-07T20:23:12.3192983Z initial apicid : 9 2025-05-07T20:23:12.3193191Z fpu : yes 2025-05-07T20:23:12.3193383Z fpu_exception : yes 2025-05-07T20:23:12.3193602Z cpuid level : 13 2025-05-07T20:23:12.3193809Z wp : yes 2025-05-07T20:23:12.3195713Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:12.3198047Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:12.3198526Z bogomips : 5599.62 2025-05-07T20:23:12.3198748Z TLB size : 3072 4K pages 2025-05-07T20:23:12.3198979Z clflush size : 64 2025-05-07T20:23:12.3199192Z cache_alignment : 64 2025-05-07T20:23:12.3199589Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:12.3199901Z power management: 2025-05-07T20:23:12.3200043Z 2025-05-07T20:23:12.3200128Z processor : 13 2025-05-07T20:23:12.3200345Z vendor_id : AuthenticAMD 2025-05-07T20:23:12.3200582Z cpu family : 23 2025-05-07T20:23:12.3200781Z model : 49 2025-05-07T20:23:12.3200983Z model name : AMD EPYC 7R32 2025-05-07T20:23:12.3201223Z stepping : 0 2025-05-07T20:23:12.3201424Z microcode : 0x830107f 2025-05-07T20:23:12.3201654Z cpu MHz : 3010.012 2025-05-07T20:23:12.3201868Z cache size : 512 KB 2025-05-07T20:23:12.3202079Z physical id : 0 2025-05-07T20:23:12.3202293Z siblings : 16 2025-05-07T20:23:12.3202492Z core id : 5 2025-05-07T20:23:12.3202684Z cpu cores : 8 2025-05-07T20:23:12.3202881Z apicid : 11 2025-05-07T20:23:12.3203081Z initial apicid : 11 2025-05-07T20:23:12.3203284Z fpu : yes 2025-05-07T20:23:12.3203485Z fpu_exception : yes 2025-05-07T20:23:12.3203698Z cpuid level : 13 2025-05-07T20:23:12.3203902Z wp : yes 2025-05-07T20:23:12.3205818Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:12.3208007Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:12.3208496Z bogomips : 5599.62 2025-05-07T20:23:12.3208718Z TLB size : 3072 4K pages 2025-05-07T20:23:12.3208946Z clflush size : 64 2025-05-07T20:23:12.3209159Z cache_alignment : 64 2025-05-07T20:23:12.3209428Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:12.3209739Z power management: 2025-05-07T20:23:12.3209876Z 2025-05-07T20:23:12.3209959Z processor : 14 2025-05-07T20:23:12.3210175Z vendor_id : AuthenticAMD 2025-05-07T20:23:12.3210406Z cpu family : 23 2025-05-07T20:23:12.3210611Z model : 49 2025-05-07T20:23:12.3210817Z model name : AMD EPYC 7R32 2025-05-07T20:23:12.3211052Z stepping : 0 2025-05-07T20:23:12.3211260Z microcode : 0x830107f 2025-05-07T20:23:12.3211484Z cpu MHz : 3294.466 2025-05-07T20:23:12.3211691Z cache size : 512 KB 2025-05-07T20:23:12.3211909Z physical id : 0 2025-05-07T20:23:12.3212116Z siblings : 16 2025-05-07T20:23:12.3212309Z core id : 6 2025-05-07T20:23:12.3212509Z cpu cores : 8 2025-05-07T20:23:12.3212707Z apicid : 13 2025-05-07T20:23:12.3212907Z initial apicid : 13 2025-05-07T20:23:12.3213125Z fpu : yes 2025-05-07T20:23:12.3213326Z fpu_exception : yes 2025-05-07T20:23:12.3213533Z cpuid level : 13 2025-05-07T20:23:12.3213739Z wp : yes 2025-05-07T20:23:12.3215664Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:12.3217976Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:12.3218453Z bogomips : 5599.62 2025-05-07T20:23:12.3218672Z TLB size : 3072 4K pages 2025-05-07T20:23:12.3218906Z clflush size : 64 2025-05-07T20:23:12.3219120Z cache_alignment : 64 2025-05-07T20:23:12.3219382Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:12.3219691Z power management: 2025-05-07T20:23:12.3219879Z 2025-05-07T20:23:12.3220060Z processor : 15 2025-05-07T20:23:12.3220277Z vendor_id : AuthenticAMD 2025-05-07T20:23:12.3220513Z cpu family : 23 2025-05-07T20:23:12.3220719Z model : 49 2025-05-07T20:23:12.3220921Z model name : AMD EPYC 7R32 2025-05-07T20:23:12.3221160Z stepping : 0 2025-05-07T20:23:12.3221374Z microcode : 0x830107f 2025-05-07T20:23:12.3221600Z cpu MHz : 3206.726 2025-05-07T20:23:12.3221809Z cache size : 512 KB 2025-05-07T20:23:12.3222022Z physical id : 0 2025-05-07T20:23:12.3222234Z siblings : 16 2025-05-07T20:23:12.3222429Z core id : 7 2025-05-07T20:23:12.3222631Z cpu cores : 8 2025-05-07T20:23:12.3222827Z apicid : 15 2025-05-07T20:23:12.3223023Z initial apicid : 15 2025-05-07T20:23:12.3223238Z fpu : yes 2025-05-07T20:23:12.3223444Z fpu_exception : yes 2025-05-07T20:23:12.3223661Z cpuid level : 13 2025-05-07T20:23:12.3223876Z wp : yes 2025-05-07T20:23:12.3225800Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:12.3227993Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:12.3228479Z bogomips : 5599.62 2025-05-07T20:23:12.3228700Z TLB size : 3072 4K pages 2025-05-07T20:23:12.3228936Z clflush size : 64 2025-05-07T20:23:12.3229159Z cache_alignment : 64 2025-05-07T20:23:12.3229429Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:12.3229747Z power management: 2025-05-07T20:23:12.3229881Z 2025-05-07T20:23:12.3229885Z 2025-05-07T20:23:12.3230013Z ################################################################################ 2025-05-07T20:23:12.3230322Z [INFO] Print PCI info ... 2025-05-07T20:23:12.3230578Z + lspci -v 2025-05-07T20:23:12.3230694Z 2025-05-07T20:23:12.3230908Z 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] 2025-05-07T20:23:12.3231292Z Subsystem: Amazon.com, Inc. Device 1237 2025-05-07T20:23:12.3231610Z Flags: bus master, medium devsel, latency 0 2025-05-07T20:23:12.3231825Z 2025-05-07T20:23:12.3232026Z 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II] 2025-05-07T20:23:12.3232410Z Physical Slot: 1 2025-05-07T20:23:12.3232666Z Flags: bus master, fast devsel, latency 0 2025-05-07T20:23:12.3232892Z 2025-05-07T20:23:12.3233164Z 00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08) 2025-05-07T20:23:12.3233597Z Physical Slot: 1 2025-05-07T20:23:12.3233858Z Flags: bus master, fast devsel, latency 0, IRQ 9 2025-05-07T20:23:12.3234082Z 2025-05-07T20:23:12.3234355Z 00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111 (prog-if 00 [VGA controller]) 2025-05-07T20:23:12.3234798Z Physical Slot: 3 2025-05-07T20:23:12.3235047Z Flags: bus master, fast devsel, latency 0 2025-05-07T20:23:12.3235392Z Memory at c1000000 (32-bit, prefetchable) [size=4M] 2025-05-07T20:23:12.3235745Z Expansion ROM at 000c0000 [disabled] [size=128K] 2025-05-07T20:23:12.3235974Z 2025-05-07T20:23:12.3236276Z 00:04.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe EBS Controller (prog-if 02 [NVM Express]) 2025-05-07T20:23:12.3236874Z Subsystem: Amazon.com, Inc. Device 0000 2025-05-07T20:23:12.3237162Z Physical Slot: 4 2025-05-07T20:23:12.3237429Z Flags: bus master, fast devsel, latency 0, IRQ 11 2025-05-07T20:23:12.3237813Z Memory at c1808000 (32-bit, non-prefetchable) [size=16K] 2025-05-07T20:23:12.3238170Z Capabilities: 2025-05-07T20:23:12.3238439Z Kernel driver in use: nvme 2025-05-07T20:23:12.3238610Z 2025-05-07T20:23:12.3238911Z 00:05.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA) 2025-05-07T20:23:12.3239396Z Subsystem: Amazon.com, Inc. Elastic Network Adapter (ENA) 2025-05-07T20:23:12.3239739Z Physical Slot: 5 2025-05-07T20:23:12.3239992Z Flags: bus master, fast devsel, latency 0 2025-05-07T20:23:12.3240353Z Memory at c1804000 (32-bit, non-prefetchable) [size=16K] 2025-05-07T20:23:12.3240732Z Memory at c1400000 (32-bit, prefetchable) [size=4M] 2025-05-07T20:23:12.3241066Z Capabilities: 2025-05-07T20:23:12.3241350Z Kernel driver in use: ena 2025-05-07T20:23:12.3241603Z Kernel modules: ena 2025-05-07T20:23:12.3241745Z 2025-05-07T20:23:12.3241918Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1) 2025-05-07T20:23:12.3242306Z Subsystem: NVIDIA Corporation Device 152f 2025-05-07T20:23:12.3242601Z Physical Slot: 30 2025-05-07T20:23:12.3242879Z Flags: bus master, fast devsel, latency 0, IRQ 10 2025-05-07T20:23:12.3243295Z Memory at c0000000 (32-bit, non-prefetchable) [size=16M] 2025-05-07T20:23:12.3243696Z Memory at 1800000000 (64-bit, prefetchable) [size=32G] 2025-05-07T20:23:12.3244069Z Memory at 1040000000 (64-bit, prefetchable) [size=32M] 2025-05-07T20:23:12.3244410Z Capabilities: 2025-05-07T20:23:12.3244689Z Kernel driver in use: nvidia 2025-05-07T20:23:12.3244953Z Kernel modules: nvidia 2025-05-07T20:23:12.3245097Z 2025-05-07T20:23:12.3245400Z 00:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller (prog-if 02 [NVM Express]) 2025-05-07T20:23:12.3245919Z Subsystem: Amazon.com, Inc. Device 0000 2025-05-07T20:23:12.3246213Z Physical Slot: 31 2025-05-07T20:23:12.3246459Z Flags: bus master, fast devsel, latency 0 2025-05-07T20:23:12.3246820Z Memory at c1800000 (32-bit, non-prefetchable) [size=16K] 2025-05-07T20:23:12.3247207Z Memory at c180c000 (32-bit, prefetchable) [size=8K] 2025-05-07T20:23:12.3247538Z Capabilities: 2025-05-07T20:23:12.3247813Z Kernel driver in use: nvme 2025-05-07T20:23:12.3247982Z 2025-05-07T20:23:12.3247986Z 2025-05-07T20:23:12.3248107Z ################################################################################ 2025-05-07T20:23:12.3248437Z [INFO] Print Linux distribution info ... 2025-05-07T20:23:12.3248722Z + uname -a 2025-05-07T20:23:12.3248847Z 2025-05-07T20:23:12.3249252Z Linux ip-10-0-1-116.ec2.internal 6.1.130-139.222.amzn2023.x86_64 #1 SMP PREEMPT_DYNAMIC Tue Mar 11 01:10:58 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux 2025-05-07T20:23:12.3249737Z 2025-05-07T20:23:12.3249824Z + uname -m 2025-05-07T20:23:12.3249945Z 2025-05-07T20:23:12.3250023Z x86_64 2025-05-07T20:23:12.3250129Z 2025-05-07T20:23:12.3250215Z + cat /proc/version 2025-05-07T20:23:12.3250353Z 2025-05-07T20:23:12.3250886Z Linux version 6.1.130-139.222.amzn2023.x86_64 (mockbuild@ip-10-0-55-76) (gcc (GCC) 11.5.0 20240719 (Red Hat 11.5.0-5), GNU ld version 2.39-6.amzn2023.0.11) #1 SMP PREEMPT_DYNAMIC Tue Mar 11 01:10:58 UTC 2025 2025-05-07T20:23:12.3251495Z 2025-05-07T20:23:12.3251586Z + cat /etc/os-release 2025-05-07T20:23:12.3251727Z 2025-05-07T20:23:12.3251839Z NAME="Amazon Linux" 2025-05-07T20:23:12.3252051Z VERSION="2023" 2025-05-07T20:23:12.3252257Z ID="amzn" 2025-05-07T20:23:12.3252445Z ID_LIKE="fedora" 2025-05-07T20:23:12.3252654Z VERSION_ID="2023" 2025-05-07T20:23:12.3252891Z PLATFORM_ID="platform:al2023" 2025-05-07T20:23:12.3253179Z PRETTY_NAME="Amazon Linux 2023.6.20250317" 2025-05-07T20:23:12.3260237Z ANSI_COLOR="0;33" 2025-05-07T20:23:12.3260541Z CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2023" 2025-05-07T20:23:12.3261067Z HOME_URL="https://aws.amazon.com/linux/amazon-linux-2023/" 2025-05-07T20:23:12.3261515Z DOCUMENTATION_URL="https://docs.aws.amazon.com/linux/" 2025-05-07T20:23:12.3261944Z SUPPORT_URL="https://aws.amazon.com/premiumsupport/" 2025-05-07T20:23:12.3262392Z BUG_REPORT_URL="https://github.com/amazonlinux/amazon-linux-2023" 2025-05-07T20:23:12.3262767Z VENDOR_NAME="AWS" 2025-05-07T20:23:12.3263015Z VENDOR_URL="https://aws.amazon.com/" 2025-05-07T20:23:12.3263303Z SUPPORT_END="2029-06-30" 2025-05-07T20:23:12.3263463Z 2025-05-07T20:23:12.3263703Z ################################################################################ 2025-05-07T20:23:12.3264020Z # Print EC2 Instance Info 2025-05-07T20:23:12.3264264Z # 2025-05-07T20:23:12.3264486Z # [2025-05-07T20:23:12.325Z] + print_ec2_info 2025-05-07T20:23:12.3264807Z ################################################################################ 2025-05-07T20:23:12.3265021Z 2025-05-07T20:23:12.3386721Z ami-id: ami-071226ecf16aa7d96 2025-05-07T20:23:12.3496321Z instance-id: i-0c2643f2bcfaf5e6b 2025-05-07T20:23:12.3603618Z instance-type: g5.4xlarge 2025-05-07T20:23:12.3646486Z ##[group]Run . $PRELUDE; print_gpu_info 2025-05-07T20:23:12.3646848Z . $PRELUDE; print_gpu_info 2025-05-07T20:23:12.3657356Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:23:12.3657702Z env: 2025-05-07T20:23:12.3657923Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:23:12.3658228Z BUILD_ENV: build_binary 2025-05-07T20:23:12.3658480Z BUILD_TARGET: genai 2025-05-07T20:23:12.3658709Z BUILD_VARIANT: cuda 2025-05-07T20:23:12.3658953Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:23:12.3659219Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:23:12.3659523Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:23:12.3659962Z ##[endgroup] 2025-05-07T20:23:12.7035521Z ################################################################################ 2025-05-07T20:23:12.7035939Z [INFO] Printing general display info ... 2025-05-07T20:23:12.7065696Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:23:12.8230980Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:23:12.8239706Z /usr/bin/sudo 2025-05-07T20:23:12.8250740Z which: no apt-get in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin) 2025-05-07T20:23:12.8261684Z /usr/bin/yum 2025-05-07T20:23:12.8263312Z [INSTALL] Updating system repositories ... 2025-05-07T20:23:12.8283577Z [EXEC] [ATTEMPT 0/3] + sudo yum update -y 2025-05-07T20:23:13.2661468Z Last metadata expiration check: 0:00:09 ago on Wed May 7 20:23:04 2025. 2025-05-07T20:23:13.3404754Z ================================================================================ 2025-05-07T20:23:13.3405083Z WARNING: 2025-05-07T20:23:13.3405340Z A newer release of "Amazon Linux" is available. 2025-05-07T20:23:13.3405572Z 2025-05-07T20:23:13.3405671Z Available Versions: 2025-05-07T20:23:13.3405817Z 2025-05-07T20:23:13.3405930Z Version 2023.7.20250331: 2025-05-07T20:23:13.3406233Z Run the following command to upgrade to 2023.7.20250331: 2025-05-07T20:23:13.3406512Z 2025-05-07T20:23:13.3406644Z dnf upgrade --releasever=2023.7.20250331 2025-05-07T20:23:13.3406847Z 2025-05-07T20:23:13.3406935Z Release notes: 2025-05-07T20:23:13.3407332Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html 2025-05-07T20:23:13.3407702Z 2025-05-07T20:23:13.3407792Z Version 2023.7.20250414: 2025-05-07T20:23:13.3408099Z Run the following command to upgrade to 2023.7.20250414: 2025-05-07T20:23:13.3408341Z 2025-05-07T20:23:13.3408461Z dnf upgrade --releasever=2023.7.20250414 2025-05-07T20:23:13.3408668Z 2025-05-07T20:23:13.3408752Z Release notes: 2025-05-07T20:23:13.3409141Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html 2025-05-07T20:23:13.3409501Z 2025-05-07T20:23:13.3409589Z Version 2023.7.20250428: 2025-05-07T20:23:13.3409891Z Run the following command to upgrade to 2023.7.20250428: 2025-05-07T20:23:13.3410137Z 2025-05-07T20:23:13.3410465Z dnf upgrade --releasever=2023.7.20250428 2025-05-07T20:23:13.3410679Z 2025-05-07T20:23:13.3410764Z Release notes: 2025-05-07T20:23:13.3411150Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html 2025-05-07T20:23:13.3411506Z 2025-05-07T20:23:13.3411626Z ================================================================================ 2025-05-07T20:23:13.4578177Z Dependencies resolved. 2025-05-07T20:23:13.4867899Z ================================================================================ 2025-05-07T20:23:13.4868320Z Package Arch Version Repository Size 2025-05-07T20:23:13.4868706Z ================================================================================ 2025-05-07T20:23:13.4869001Z Upgrading: 2025-05-07T20:23:13.4869358Z nvidia-container-toolkit x86_64 1.17.6-1 nvidia-container-toolkit 1.2 M 2025-05-07T20:23:13.4869937Z nvidia-container-toolkit-base x86_64 1.17.6-1 nvidia-container-toolkit 5.7 M 2025-05-07T20:23:13.4870291Z 2025-05-07T20:23:13.4870610Z Transaction Summary 2025-05-07T20:23:13.4870868Z ================================================================================ 2025-05-07T20:23:13.4871173Z Upgrade 2 Packages 2025-05-07T20:23:13.4871309Z 2025-05-07T20:23:13.4871753Z Total download size: 6.9 M 2025-05-07T20:23:13.4872652Z Downloading Packages: 2025-05-07T20:23:13.5288521Z (1/2): nvidia-container-toolkit-1.17.6-1.x86_64 31 MB/s | 1.2 MB 00:00 2025-05-07T20:23:13.5753326Z (2/2): nvidia-container-toolkit-base-1.17.6-1.x 65 MB/s | 5.7 MB 00:00 2025-05-07T20:23:13.5761265Z -------------------------------------------------------------------------------- 2025-05-07T20:23:13.5764445Z Total 78 MB/s | 6.9 MB 00:00 2025-05-07T20:23:13.5767008Z Running transaction check 2025-05-07T20:23:13.5866190Z Transaction check succeeded. 2025-05-07T20:23:13.5866654Z Running transaction test 2025-05-07T20:23:13.6159403Z Transaction test succeeded. 2025-05-07T20:23:13.6163016Z Running transaction 2025-05-07T20:23:14.1663278Z Preparing : 1/1 2025-05-07T20:23:14.2728496Z Upgrading : nvidia-container-toolkit-base-1.17.6-1.x86_64 1/4 2025-05-07T20:23:14.2757378Z Upgrading : nvidia-container-toolkit-1.17.6-1.x86_64 2/4 2025-05-07T20:23:14.2948957Z Running scriptlet: nvidia-container-toolkit-1.17.6-1.x86_64 2/4 2025-05-07T20:23:14.2949731Z Cleanup : nvidia-container-toolkit-1.16.2-1.x86_64 3/4 2025-05-07T20:23:14.3062569Z Running scriptlet: nvidia-container-toolkit-1.16.2-1.x86_64 3/4 2025-05-07T20:23:14.3092244Z Cleanup : nvidia-container-toolkit-base-1.16.2-1.x86_64 4/4 2025-05-07T20:23:14.4525140Z Running scriptlet: nvidia-container-toolkit-1.17.6-1.x86_64 4/4 2025-05-07T20:23:14.4526264Z Verifying : nvidia-container-toolkit-1.17.6-1.x86_64 1/4 2025-05-07T20:23:14.4527345Z Verifying : nvidia-container-toolkit-1.16.2-1.x86_64 2/4 2025-05-07T20:23:14.4528373Z Verifying : nvidia-container-toolkit-base-1.17.6-1.x86_64 3/4 2025-05-07T20:23:14.6063040Z ================================================================================ 2025-05-07T20:23:14.6063729Z WARNING: 2025-05-07T20:23:14.6064057Z A newer release of "Amazon Linux" is available. 2025-05-07T20:23:14.6064321Z 2025-05-07T20:23:14.6064414Z Available Versions: 2025-05-07T20:23:14.6064562Z 2025-05-07T20:23:14.6064661Z Version 2023.7.20250331: 2025-05-07T20:23:14.6064966Z Run the following command to upgrade to 2023.7.20250331: 2025-05-07T20:23:14.6065224Z 2025-05-07T20:23:14.6065347Z dnf upgrade --releasever=2023.7.20250331 2025-05-07T20:23:14.6065557Z 2025-05-07T20:23:14.6065641Z Release notes: 2025-05-07T20:23:14.6066042Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html 2025-05-07T20:23:14.6066699Z 2025-05-07T20:23:14.6066804Z Version 2023.7.20250414: 2025-05-07T20:23:14.6067112Z Run the following command to upgrade to 2023.7.20250414: 2025-05-07T20:23:14.6067360Z 2025-05-07T20:23:14.6067482Z dnf upgrade --releasever=2023.7.20250414 2025-05-07T20:23:14.6067689Z 2025-05-07T20:23:14.6067781Z Release notes: 2025-05-07T20:23:14.6068164Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html 2025-05-07T20:23:14.6068527Z 2025-05-07T20:23:14.6068615Z Version 2023.7.20250428: 2025-05-07T20:23:14.6068930Z Run the following command to upgrade to 2023.7.20250428: 2025-05-07T20:23:14.6069176Z 2025-05-07T20:23:14.6069290Z dnf upgrade --releasever=2023.7.20250428 2025-05-07T20:23:14.6069503Z 2025-05-07T20:23:14.6069591Z Release notes: 2025-05-07T20:23:14.6069978Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html 2025-05-07T20:23:14.6070336Z 2025-05-07T20:23:14.6070670Z ================================================================================ 2025-05-07T20:23:14.6639701Z Verifying : nvidia-container-toolkit-base-1.16.2-1.x86_64 4/4 2025-05-07T20:23:14.6640046Z 2025-05-07T20:23:14.6640129Z Upgraded: 2025-05-07T20:23:14.6640474Z nvidia-container-toolkit-1.17.6-1.x86_64 2025-05-07T20:23:14.6641028Z nvidia-container-toolkit-base-1.17.6-1.x86_64 2025-05-07T20:23:14.6641359Z 2025-05-07T20:23:14.6641437Z Complete! 2025-05-07T20:23:14.7105851Z [INSTALL] Installing system package(s): hostname lshw ... 2025-05-07T20:23:14.7130582Z [EXEC] [ATTEMPT 0/3] + sudo yum install -y hostname lshw 2025-05-07T20:23:15.2300629Z Last metadata expiration check: 0:00:11 ago on Wed May 7 20:23:04 2025. 2025-05-07T20:23:15.2539208Z Package hostname-3.23-4.amzn2023.0.3.x86_64 is already installed. 2025-05-07T20:23:15.2943942Z Dependencies resolved. 2025-05-07T20:23:15.3120409Z ================================================================================ 2025-05-07T20:23:15.3120878Z Package Architecture Version Repository Size 2025-05-07T20:23:15.3121292Z ================================================================================ 2025-05-07T20:23:15.3121594Z Installing: 2025-05-07T20:23:15.3121889Z lshw x86_64 B.02.19.2-7.amzn2023.0.3 amazonlinux 319 k 2025-05-07T20:23:15.3122152Z 2025-05-07T20:23:15.3122253Z Transaction Summary 2025-05-07T20:23:15.3122498Z ================================================================================ 2025-05-07T20:23:15.3122800Z Install 1 Package 2025-05-07T20:23:15.3122935Z 2025-05-07T20:23:15.3123057Z Total download size: 319 k 2025-05-07T20:23:15.3123848Z Installed size: 837 k 2025-05-07T20:23:15.3124848Z Downloading Packages: 2025-05-07T20:23:15.3901497Z lshw-B.02.19.2-7.amzn2023.0.3.x86_64.rpm 6.6 MB/s | 319 kB 00:00 2025-05-07T20:23:15.3906988Z -------------------------------------------------------------------------------- 2025-05-07T20:23:15.3909745Z Total 4.0 MB/s | 319 kB 00:00 2025-05-07T20:23:15.4067454Z Running transaction check 2025-05-07T20:23:15.4122849Z Transaction check succeeded. 2025-05-07T20:23:15.4123408Z Running transaction test 2025-05-07T20:23:15.4582760Z Transaction test succeeded. 2025-05-07T20:23:15.4586543Z Running transaction 2025-05-07T20:23:15.5641565Z Preparing : 1/1 2025-05-07T20:23:15.6179701Z Installing : lshw-B.02.19.2-7.amzn2023.0.3.x86_64 1/1 2025-05-07T20:23:15.8295635Z Running scriptlet: lshw-B.02.19.2-7.amzn2023.0.3.x86_64 1/1 2025-05-07T20:23:15.9516400Z ================================================================================ 2025-05-07T20:23:15.9516782Z WARNING: 2025-05-07T20:23:15.9517039Z A newer release of "Amazon Linux" is available. 2025-05-07T20:23:15.9517561Z 2025-05-07T20:23:15.9517664Z Available Versions: 2025-05-07T20:23:15.9517828Z 2025-05-07T20:23:15.9517920Z Version 2023.7.20250331: 2025-05-07T20:23:15.9518242Z Run the following command to upgrade to 2023.7.20250331: 2025-05-07T20:23:15.9518490Z 2025-05-07T20:23:15.9518621Z dnf upgrade --releasever=2023.7.20250331 2025-05-07T20:23:15.9518829Z 2025-05-07T20:23:15.9518916Z Release notes: 2025-05-07T20:23:15.9519329Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html 2025-05-07T20:23:15.9519702Z 2025-05-07T20:23:15.9519796Z Version 2023.7.20250414: 2025-05-07T20:23:15.9520105Z Run the following command to upgrade to 2023.7.20250414: 2025-05-07T20:23:15.9520348Z 2025-05-07T20:23:15.9520464Z dnf upgrade --releasever=2023.7.20250414 2025-05-07T20:23:15.9520674Z 2025-05-07T20:23:15.9520762Z Release notes: 2025-05-07T20:23:15.9521157Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html 2025-05-07T20:23:15.9521525Z 2025-05-07T20:23:15.9521856Z Version 2023.7.20250428: 2025-05-07T20:23:15.9522166Z Run the following command to upgrade to 2023.7.20250428: 2025-05-07T20:23:15.9522414Z 2025-05-07T20:23:15.9522531Z dnf upgrade --releasever=2023.7.20250428 2025-05-07T20:23:15.9522734Z 2025-05-07T20:23:15.9522828Z Release notes: 2025-05-07T20:23:15.9523217Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html 2025-05-07T20:23:15.9523580Z 2025-05-07T20:23:15.9523695Z ================================================================================ 2025-05-07T20:23:15.9861344Z Verifying : lshw-B.02.19.2-7.amzn2023.0.3.x86_64 1/1 2025-05-07T20:23:15.9861664Z 2025-05-07T20:23:15.9861757Z Installed: 2025-05-07T20:23:15.9862063Z lshw-B.02.19.2-7.amzn2023.0.3.x86_64 2025-05-07T20:23:15.9862354Z 2025-05-07T20:23:15.9862436Z Complete! 2025-05-07T20:23:16.0332268Z + hostname 2025-05-07T20:23:16.0332461Z 2025-05-07T20:23:16.0346965Z ip-10-0-1-116.ec2.internal 2025-05-07T20:23:16.0348523Z 2025-05-07T20:23:16.0349166Z + sudo lshw -C display 2025-05-07T20:23:16.0349375Z 2025-05-07T20:23:16.4780672Z *-display:0 UNCLAIMED 2025-05-07T20:23:16.4781080Z description: VGA compatible controller 2025-05-07T20:23:16.4781413Z product: Amazon.com, Inc. 2025-05-07T20:23:16.4781688Z vendor: Amazon.com, Inc. 2025-05-07T20:23:16.4781938Z physical id: 3 2025-05-07T20:23:16.4782175Z bus info: pci@0000:00:03.0 2025-05-07T20:23:16.4782431Z version: 00 2025-05-07T20:23:16.4782635Z width: 32 bits 2025-05-07T20:23:16.4782854Z clock: 33MHz 2025-05-07T20:23:16.4783098Z capabilities: vga_controller bus_master 2025-05-07T20:23:16.4783415Z configuration: latency=0 2025-05-07T20:23:16.4783730Z resources: memory:c1000000-c13fffff memory:c0000-dffff 2025-05-07T20:23:16.4784055Z *-display:1 2025-05-07T20:23:16.4784283Z description: 3D controller 2025-05-07T20:23:16.4784590Z product: GA102GL [A10G] 2025-05-07T20:23:16.4784855Z vendor: NVIDIA Corporation 2025-05-07T20:23:16.4785119Z physical id: 1e 2025-05-07T20:23:16.4785350Z bus info: pci@0000:00:1e.0 2025-05-07T20:23:16.4785600Z version: a1 2025-05-07T20:23:16.4785808Z width: 64 bits 2025-05-07T20:23:16.4786018Z clock: 33MHz 2025-05-07T20:23:16.4786310Z capabilities: pm pciexpress msix bus_master cap_list 2025-05-07T20:23:16.4786676Z configuration: driver=nvidia latency=0 2025-05-07T20:23:16.4787282Z resources: iomemory:180-17f iomemory:100-ff irq:10 memory:c0000000-c0ffffff memory:1800000000-1fffffffff memory:1040000000-1041ffffff 2025-05-07T20:23:16.4819221Z 2025-05-07T20:23:16.4819555Z ################################################################################ 2025-05-07T20:23:16.4819989Z [INFO] Printing NVIDIA GPU info ... 2025-05-07T20:23:16.4948563Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1) 2025-05-07T20:23:16.5115045Z Wed May 7 20:23:16 2025 2025-05-07T20:23:16.5115437Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:16.5115927Z | NVIDIA-SMI 570.133.07 Driver Version: 570.133.07 CUDA Version: 12.8 | 2025-05-07T20:23:16.5117818Z |-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:23:16.5118290Z | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | 2025-05-07T20:23:16.5118807Z | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | 2025-05-07T20:23:16.5119229Z | | | MIG M. | 2025-05-07T20:23:16.5119560Z |=========================================+========================+======================| 2025-05-07T20:23:16.5195193Z | 0 NVIDIA A10G On | 00000000:00:1E.0 Off | 0 | 2025-05-07T20:23:16.5195862Z | 0% 31C P0 60W / 300W | 0MiB / 23028MiB | 0% Default | 2025-05-07T20:23:16.5196239Z | | | N/A | 2025-05-07T20:23:16.5196620Z +-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:23:16.5197015Z 2025-05-07T20:23:16.5197400Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:16.5197818Z | Processes: | 2025-05-07T20:23:16.5198247Z | GPU GI CI PID Type Process name GPU Memory | 2025-05-07T20:23:16.5198653Z | ID ID Usage | 2025-05-07T20:23:16.5199001Z |=========================================================================================| 2025-05-07T20:23:16.5199997Z | No running processes found | 2025-05-07T20:23:16.5200456Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:16.6606094Z ################################################################################ 2025-05-07T20:23:16.6606445Z [INFO] Printing AMD GPU info ... 2025-05-07T20:23:16.6750184Z which: no rocminfo in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin) 2025-05-07T20:23:16.6751226Z [CHECK] rocminfo not found 2025-05-07T20:23:16.6759962Z which: no rocm-smi in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin) 2025-05-07T20:23:16.6761438Z [CHECK] rocm-smi not found 2025-05-07T20:23:16.6820428Z ##[group]Run . $PRELUDE; setup_miniconda $HOME/miniconda 2025-05-07T20:23:16.6820867Z . $PRELUDE; setup_miniconda $HOME/miniconda 2025-05-07T20:23:16.6832584Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:23:16.6832938Z env: 2025-05-07T20:23:16.6833163Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:23:16.6833464Z BUILD_ENV: build_binary 2025-05-07T20:23:16.6833701Z BUILD_TARGET: genai 2025-05-07T20:23:16.6833912Z BUILD_VARIANT: cuda 2025-05-07T20:23:16.6834133Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:23:16.6834380Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:23:16.6834668Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:23:16.6834996Z ##[endgroup] 2025-05-07T20:23:17.0189202Z ################################################################################ 2025-05-07T20:23:17.0189576Z # Setup Miniconda 2025-05-07T20:23:17.0189794Z # 2025-05-07T20:23:17.0206913Z # [2025-05-07T20:23:17.020Z] + setup_miniconda /home/ec2-user/miniconda 2025-05-07T20:23:17.0207462Z ################################################################################ 2025-05-07T20:23:17.0207679Z 2025-05-07T20:23:17.0222566Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:23:17.1129824Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:23:17.1130320Z + mkdir -p /home/ec2-user/miniconda 2025-05-07T20:23:17.1130516Z 2025-05-07T20:23:17.1147385Z 2025-05-07T20:23:17.1147595Z [SETUP] Downloading the Miniconda installer ... 2025-05-07T20:23:17.1170180Z [EXEC] [ATTEMPT 0/3] + wget -q https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh 2025-05-07T20:23:18.4668936Z [SETUP] Installing Miniconda ... 2025-05-07T20:23:18.4669333Z + bash miniconda.sh -b -p /home/ec2-user/miniconda -u 2025-05-07T20:23:18.4669597Z 2025-05-07T20:23:18.4815153Z PREFIX=/home/ec2-user/miniconda 2025-05-07T20:23:18.9278176Z Unpacking payload ... 2025-05-07T20:23:19.4455360Z entry_point.py:256: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior. 2025-05-07T20:23:20.2490407Z entry_point.py:256: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior. 2025-05-07T20:23:22.3487497Z 2025-05-07T20:23:22.3487909Z Installing base environment... 2025-05-07T20:23:22.3488162Z 2025-05-07T20:23:23.4274268Z Preparing transaction: ...working... done 2025-05-07T20:23:26.4329981Z Executing transaction: ...working... done 2025-05-07T20:23:27.0905534Z entry_point.py:256: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior. 2025-05-07T20:23:27.1785434Z installation finished. 2025-05-07T20:23:27.1793682Z 2025-05-07T20:23:27.1794583Z + rm -f miniconda.sh 2025-05-07T20:23:27.1794850Z 2025-05-07T20:23:27.2099515Z 2025-05-07T20:23:27.2099943Z [SETUP] Reloading the bash configuration ... 2025-05-07T20:23:27.2100309Z + /home/ec2-user/miniconda/bin/conda init bash 2025-05-07T20:23:27.5741186Z 2025-05-07T20:23:27.5741470Z no change /home/ec2-user/miniconda/condabin/conda 2025-05-07T20:23:27.5742042Z no change /home/ec2-user/miniconda/bin/conda 2025-05-07T20:23:27.5742576Z no change /home/ec2-user/miniconda/bin/conda-env 2025-05-07T20:23:27.5743100Z no change /home/ec2-user/miniconda/bin/activate 2025-05-07T20:23:27.5743636Z no change /home/ec2-user/miniconda/bin/deactivate 2025-05-07T20:23:27.5744219Z no change /home/ec2-user/miniconda/etc/profile.d/conda.sh 2025-05-07T20:23:27.5744859Z no change /home/ec2-user/miniconda/etc/fish/conf.d/conda.fish 2025-05-07T20:23:27.5745513Z no change /home/ec2-user/miniconda/shell/condabin/Conda.psm1 2025-05-07T20:23:27.5746196Z no change /home/ec2-user/miniconda/shell/condabin/conda-hook.ps1 2025-05-07T20:23:27.5747344Z no change /home/ec2-user/miniconda/lib/python3.13/site-packages/xontrib/conda.xsh 2025-05-07T20:23:27.5747916Z no change /home/ec2-user/miniconda/etc/profile.d/conda.csh 2025-05-07T20:23:27.5748284Z modified /home/ec2-user/.bashrc 2025-05-07T20:23:27.5748474Z 2025-05-07T20:23:27.5748676Z ==> For changes to take effect, close and re-open your current shell. <== 2025-05-07T20:23:27.5748971Z 2025-05-07T20:23:27.6397060Z 2025-05-07T20:23:27.6397515Z + . /home/ec2-user/.bashrc 2025-05-07T20:23:27.6397702Z 2025-05-07T20:23:28.4737186Z 2025-05-07T20:23:28.4737930Z [SETUP] Installing libmamba-solver (required since Anaconda 2024.02-1) and libarchive ... 2025-05-07T20:23:28.4761518Z [EXEC] [ATTEMPT 0/3] + conda install --solver=classic -c conda-forge --override-channels -y conda-libmamba-solver libmamba libmambapy libarchive 2025-05-07T20:23:42.0884696Z Collecting package metadata (current_repodata.json): - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - done 2025-05-07T20:23:43.6691929Z Solving environment: | / - \ | / - \ | / - \ done 2025-05-07T20:23:43.7659641Z 2025-05-07T20:23:43.7659989Z ## Package Plan ## 2025-05-07T20:23:43.7660149Z 2025-05-07T20:23:43.7660305Z environment location: /home/ec2-user/miniconda 2025-05-07T20:23:43.7660570Z 2025-05-07T20:23:43.7660675Z added / updated specs: 2025-05-07T20:23:43.7660944Z - conda-libmamba-solver 2025-05-07T20:23:43.7661215Z - libarchive 2025-05-07T20:23:43.7661425Z - libmamba 2025-05-07T20:23:43.7661635Z - libmambapy 2025-05-07T20:23:43.7661764Z 2025-05-07T20:23:43.7661768Z 2025-05-07T20:23:43.7661915Z The following packages will be downloaded: 2025-05-07T20:23:43.7662130Z 2025-05-07T20:23:43.7662248Z package | build 2025-05-07T20:23:43.7662572Z ---------------------------|----------------- 2025-05-07T20:23:43.7662991Z ca-certificates-2025.4.26 | hbd8a1cb_0 149 KB conda-forge 2025-05-07T20:23:43.7663471Z certifi-2025.4.26 | pyhd8ed1ab_0 154 KB conda-forge 2025-05-07T20:23:43.7663895Z conda-25.3.1 | py313h78bf25f_1 1.1 MB conda-forge 2025-05-07T20:23:43.7664370Z conda-libmamba-solver-25.4.0| pyhd8ed1ab_0 41 KB conda-forge 2025-05-07T20:23:43.7664818Z ------------------------------------------------------------ 2025-05-07T20:23:43.7665160Z Total: 1.4 MB 2025-05-07T20:23:43.7665376Z 2025-05-07T20:23:43.7665493Z The following packages will be UPDATED: 2025-05-07T20:23:43.7665706Z 2025-05-07T20:23:43.7670637Z ca-certificates pkgs/main/linux-64::ca-certificates-2~ --> conda-forge/noarch::ca-certificates-2025.4.26-hbd8a1cb_0 2025-05-07T20:23:43.7671413Z conda pkgs/main::conda-25.3.1-py313h06a4308~ --> conda-forge::conda-25.3.1-py313h78bf25f_1 2025-05-07T20:23:43.7671786Z 2025-05-07T20:23:43.7672001Z The following packages will be SUPERSEDED by a higher-priority channel: 2025-05-07T20:23:43.7672321Z 2025-05-07T20:23:43.7672634Z certifi pkgs/main/linux-64::certifi-2025.4.26~ --> conda-forge/noarch::certifi-2025.4.26-pyhd8ed1ab_0 2025-05-07T20:23:43.7673420Z conda-libmamba-so~ pkgs/main::conda-libmamba-solver-25.4~ --> conda-forge::conda-libmamba-solver-25.4.0-pyhd8ed1ab_0 2025-05-07T20:23:43.7673892Z 2025-05-07T20:23:43.7673904Z 2025-05-07T20:23:43.7673908Z 2025-05-07T20:23:43.7674052Z Downloading and Extracting Packages: ...working... 2025-05-07T20:23:43.7674425Z conda-25.3.1 | 1.1 MB | | 0% 2025-05-07T20:23:43.7674641Z 2025-05-07T20:23:43.7675143Z certifi-2025.4.26 | 154 KB | | 0%  2025-05-07T20:23:43.7675392Z 2025-05-07T20:23:43.7675396Z 2025-05-07T20:23:43.7696632Z ca-certificates-2025 | 149 KB | | 0%  2025-05-07T20:23:43.7696902Z 2025-05-07T20:23:43.7697104Z 2025-05-07T20:23:43.7697147Z 2025-05-07T20:23:43.8169327Z conda-libmamba-solve | 41 KB | | 0%  2025-05-07T20:23:43.8169608Z 2025-05-07T20:23:43.8169825Z 2025-05-07T20:23:43.8171386Z 2025-05-07T20:23:43.8303126Z conda-libmamba-solve | 41 KB | ########## | 100%  2025-05-07T20:23:43.8303410Z 2025-05-07T20:23:43.8303414Z 2025-05-07T20:23:43.8303418Z 2025-05-07T20:23:43.8338197Z conda-libmamba-solve | 41 KB | ########## | 100%  2025-05-07T20:23:43.8482925Z conda-25.3.1 | 1.1 MB | ########## | 100% 2025-05-07T20:23:43.8483632Z 2025-05-07T20:23:43.8524090Z certifi-2025.4.26 | 154 KB | ########## | 100%  2025-05-07T20:23:43.8524379Z 2025-05-07T20:23:43.8526608Z 2025-05-07T20:23:43.8723440Z ca-certificates-2025 | 149 KB | ########## | 100%  2025-05-07T20:23:43.8723746Z 2025-05-07T20:23:43.8723948Z 2025-05-07T20:23:43.8729195Z ca-certificates-2025 | 149 KB | ########## | 100%  2025-05-07T20:23:43.8729708Z 2025-05-07T20:23:43.8729820Z 2025-05-07T20:23:43.8752686Z ca-certificates-2025 | 149 KB | ########## | 100%  2025-05-07T20:23:43.8753059Z 2025-05-07T20:23:43.8755991Z certifi-2025.4.26 | 154 KB | ########## | 100%  2025-05-07T20:23:43.8756259Z 2025-05-07T20:23:43.9676601Z certifi-2025.4.26 | 154 KB | ########## | 100%  2025-05-07T20:23:43.9677133Z conda-25.3.1 | 1.1 MB | ########## | 100% 2025-05-07T20:23:43.9682715Z conda-25.3.1 | 1.1 MB | ########## | 100% 2025-05-07T20:23:43.9683047Z 2025-05-07T20:23:43.9683238Z 2025-05-07T20:23:43.9683433Z  2025-05-07T20:23:43.9683640Z 2025-05-07T20:23:43.9683644Z 2025-05-07T20:23:43.9683824Z  2025-05-07T20:23:43.9684040Z 2025-05-07T20:23:43.9684044Z 2025-05-07T20:23:43.9684048Z 2025-05-07T20:23:43.9684369Z  done 2025-05-07T20:23:44.0691471Z Preparing transaction: / done 2025-05-07T20:23:44.1694718Z Verifying transaction: \ done 2025-05-07T20:23:45.4712400Z Executing transaction: / - \ | / - \ | / - \ | / done 2025-05-07T20:23:47.2062312Z [SETUP] Updating Miniconda base packages ... 2025-05-07T20:23:47.2087130Z [EXEC] [ATTEMPT 0/3] + conda update -n base -c defaults --update-deps -y conda 2025-05-07T20:23:48.1184528Z Channels: 2025-05-07T20:23:48.1184789Z - defaults 2025-05-07T20:23:48.1184997Z Platform: linux-64 2025-05-07T20:23:49.3488445Z Collecting package metadata (repodata.json): - \ | / - \ | done 2025-05-07T20:23:49.4662821Z Solving environment: - \ Channels: 2025-05-07T20:23:49.4663217Z - defaults 2025-05-07T20:23:49.4663453Z Platform: linux-64 2025-05-07T20:23:49.7609890Z Collecting package metadata (repodata.json): / - \ | done 2025-05-07T20:23:49.9765880Z Solving environment: - \ | / done 2025-05-07T20:23:50.0594750Z done 2025-05-07T20:23:50.1255463Z 2025-05-07T20:23:50.1255782Z ## Package Plan ## 2025-05-07T20:23:50.1255997Z 2025-05-07T20:23:50.1256202Z environment location: /home/ec2-user/miniconda 2025-05-07T20:23:50.1256535Z 2025-05-07T20:23:50.1256666Z added / updated specs: 2025-05-07T20:23:50.1256965Z - conda 2025-05-07T20:23:50.1257083Z 2025-05-07T20:23:50.1257087Z 2025-05-07T20:23:50.1257207Z The following packages will be downloaded: 2025-05-07T20:23:50.1257424Z 2025-05-07T20:23:50.1257540Z package | build 2025-05-07T20:23:50.1257860Z ---------------------------|----------------- 2025-05-07T20:23:50.1258198Z pip-25.1 | pyhc872135_2 1.3 MB 2025-05-07T20:23:50.1258825Z tzdata-2025b | h04d1e81_0 116 KB 2025-05-07T20:23:50.1259317Z ------------------------------------------------------------ 2025-05-07T20:23:50.1259879Z Total: 1.4 MB 2025-05-07T20:23:50.1260177Z 2025-05-07T20:23:50.1260333Z The following packages will be UPDATED: 2025-05-07T20:23:50.1260625Z 2025-05-07T20:23:50.1260956Z pip pkgs/main/linux-64::pip-25.0-py313h06~ --> pkgs/main/noarch::pip-25.1-pyhc872135_2 2025-05-07T20:23:50.1261458Z tzdata 2025a-h04d1e81_0 --> 2025b-h04d1e81_0 2025-05-07T20:23:50.1261702Z 2025-05-07T20:23:50.1261706Z 2025-05-07T20:23:50.1261710Z 2025-05-07T20:23:50.1261856Z Downloading and Extracting Packages: ...working... 2025-05-07T20:23:50.1262215Z pip-25.1 | 1.3 MB | | 0% 2025-05-07T20:23:50.1262431Z 2025-05-07T20:23:50.1656070Z tzdata-2025b | 116 KB | | 0%  2025-05-07T20:23:50.1656532Z 2025-05-07T20:23:50.2260009Z tzdata-2025b | 116 KB | ########## | 100%  2025-05-07T20:23:50.2317689Z pip-25.1 | 1.3 MB | ########9 | 90% 2025-05-07T20:23:50.3546698Z pip-25.1 | 1.3 MB | ########## | 100% 2025-05-07T20:23:50.3547457Z 2025-05-07T20:23:50.3548897Z tzdata-2025b | 116 KB | ########## | 100%  2025-05-07T20:23:50.3549210Z 2025-05-07T20:23:50.3996214Z tzdata-2025b | 116 KB | ########## | 100%  2025-05-07T20:23:50.3998720Z pip-25.1 | 1.3 MB | ########## | 100% 2025-05-07T20:23:50.3999050Z 2025-05-07T20:23:50.3999249Z 2025-05-07T20:23:50.3999470Z  done 2025-05-07T20:23:50.5002512Z Preparing transaction: \ done 2025-05-07T20:23:50.6008991Z Verifying transaction: / done 2025-05-07T20:23:52.7039582Z Executing transaction: \ | / - \ | / - \ | / - \ | / - \ | / - \ done 2025-05-07T20:23:53.3154193Z [SETUP] Cleaning up Conda packages ... 2025-05-07T20:23:53.3155670Z + conda clean --packages --tarball -y 2025-05-07T20:23:53.3155978Z 2025-05-07T20:23:54.3490738Z Will remove 99 (117.8 MB) tarball(s). 2025-05-07T20:23:54.3491161Z Will remove 11 (16.0 MB) package(s). 2025-05-07T20:23:54.4211460Z 2025-05-07T20:23:54.4219995Z + conda clean --all -y 2025-05-07T20:23:54.4220200Z 2025-05-07T20:23:54.9770792Z There are no unused tarball(s) to remove. 2025-05-07T20:23:54.9771166Z Will remove 1 index cache(s). 2025-05-07T20:23:54.9771450Z There are no unused package(s) to remove. 2025-05-07T20:23:54.9771755Z There are no tempfile(s) to remove. 2025-05-07T20:23:54.9772045Z There are no logfile(s) to remove. 2025-05-07T20:23:55.0405475Z 2025-05-07T20:23:55.0410322Z + conda info 2025-05-07T20:23:55.0410497Z 2025-05-07T20:23:55.8195500Z 2025-05-07T20:23:55.8195969Z active environment : base 2025-05-07T20:23:55.8196324Z active env location : /home/ec2-user/miniconda 2025-05-07T20:23:55.8196700Z shell level : 1 2025-05-07T20:23:55.8196989Z user config file : /home/ec2-user/.condarc 2025-05-07T20:23:55.8197365Z populated config files : /home/ec2-user/miniconda/.condarc 2025-05-07T20:23:55.8197748Z conda version : 25.3.1 2025-05-07T20:23:55.8198029Z conda-build version : not installed 2025-05-07T20:23:55.8198327Z python version : 3.13.2.final.0 2025-05-07T20:23:55.8198620Z solver : libmamba (default) 2025-05-07T20:23:55.8198931Z virtual packages : __archspec=1=zen2 2025-05-07T20:23:55.8199225Z __conda=25.3.1=0 2025-05-07T20:23:55.8199501Z __cuda=12.8=0 2025-05-07T20:23:55.8199774Z __glibc=2.34=0 2025-05-07T20:23:55.8200051Z __linux=6.1.130=0 2025-05-07T20:23:55.8200316Z __unix=0=0 2025-05-07T20:23:55.8200646Z base environment : /home/ec2-user/miniconda (writable) 2025-05-07T20:23:55.8201404Z conda av data dir : /home/ec2-user/miniconda/etc/conda 2025-05-07T20:23:55.8201748Z conda av metadata url : None 2025-05-07T20:23:55.8202117Z channel URLs : https://repo.anaconda.com/pkgs/main/linux-64 2025-05-07T20:23:55.8202550Z https://repo.anaconda.com/pkgs/main/noarch 2025-05-07T20:23:55.8202929Z https://repo.anaconda.com/pkgs/r/linux-64 2025-05-07T20:23:55.8203295Z https://repo.anaconda.com/pkgs/r/noarch 2025-05-07T20:23:55.8203660Z package cache : /home/ec2-user/miniconda/pkgs 2025-05-07T20:23:55.8203999Z /home/ec2-user/.conda/pkgs 2025-05-07T20:23:55.8204330Z envs directories : /home/ec2-user/miniconda/envs 2025-05-07T20:23:55.8204661Z /home/ec2-user/.conda/envs 2025-05-07T20:23:55.8204959Z platform : linux-64 2025-05-07T20:23:55.8205790Z user-agent : conda/25.3.1 requests/2.32.3 CPython/3.13.2 Linux/6.1.130-139.222.amzn2023.x86_64 amzn/2023.6.20250317 glibc/2.34 solver/libmamba conda-libmamba-solver/25.4.0 libmambapy/2.0.5 aau/0.7.0 c/. s/. e/. 2025-05-07T20:23:55.8206597Z UID:GID : 1000:1000 2025-05-07T20:23:55.8207013Z netrc file : None 2025-05-07T20:23:55.8207274Z offline mode : False 2025-05-07T20:23:55.8207438Z 2025-05-07T20:23:55.8872431Z 2025-05-07T20:23:55.8872706Z [SETUP] Exporting Miniconda variables ... 2025-05-07T20:23:55.8873416Z [SETUP] Saving Miniconda variables to /home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/add_path_4d378ef6-9297-48d4-9fb0-05cc395e54c6 ... 2025-05-07T20:23:55.8874203Z [SETUP] Successfully set up Miniconda at /home/ec2-user/miniconda 2025-05-07T20:23:55.8946298Z ##[group]Run . $PRELUDE; create_conda_environment $BUILD_ENV 3.10 2025-05-07T20:23:55.8946787Z . $PRELUDE; create_conda_environment $BUILD_ENV 3.10 2025-05-07T20:23:55.8965592Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:23:55.8965943Z env: 2025-05-07T20:23:55.8966158Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:23:55.8966456Z BUILD_ENV: build_binary 2025-05-07T20:23:55.8966695Z BUILD_TARGET: genai 2025-05-07T20:23:55.8966930Z BUILD_VARIANT: cuda 2025-05-07T20:23:55.8967157Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:23:55.8967407Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:23:55.8967708Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:23:55.8968026Z ##[endgroup] 2025-05-07T20:23:56.2345751Z ################################################################################ 2025-05-07T20:23:56.2346198Z # Create Conda Environment 2025-05-07T20:23:56.2346441Z # 2025-05-07T20:23:56.2360702Z # [2025-05-07T20:23:56.235Z] + create_conda_environment build_binary 3.10 2025-05-07T20:23:56.2361117Z ################################################################################ 2025-05-07T20:23:56.2361331Z 2025-05-07T20:23:56.2375574Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:23:56.3291604Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:23:56.3292158Z [SETUP] Listing existing Conda environments ... 2025-05-07T20:23:56.3292620Z + conda info --envs 2025-05-07T20:23:56.3292850Z 2025-05-07T20:23:57.0944041Z 2025-05-07T20:23:57.0944291Z # conda environments: 2025-05-07T20:23:57.0944537Z # 2025-05-07T20:23:57.0952324Z base /home/ec2-user/miniconda 2025-05-07T20:23:57.0952560Z 2025-05-07T20:23:57.1615753Z 2025-05-07T20:23:57.1616249Z [SETUP] Deleting the prefix directory if it exists ... 2025-05-07T20:23:58.7957421Z + rm -rf /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:23:58.7957942Z 2025-05-07T20:23:58.7974293Z 2025-05-07T20:23:58.7983780Z [SETUP] Creating new Conda environment (Python 3.10) ... 2025-05-07T20:23:58.8005332Z [EXEC] [ATTEMPT 0/3] + conda create -y -n build_binary python=3.10 2025-05-07T20:23:59.5501546Z Channels: 2025-05-07T20:23:59.5502060Z - defaults 2025-05-07T20:23:59.5502296Z Platform: linux-64 2025-05-07T20:24:01.0898967Z Collecting package metadata (repodata.json): - \ | / - \ | / - \ done 2025-05-07T20:24:01.1903964Z Solving environment: / done 2025-05-07T20:24:01.2191102Z 2025-05-07T20:24:01.2191593Z ## Package Plan ## 2025-05-07T20:24:01.2191766Z 2025-05-07T20:24:01.2191972Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:24:01.2192284Z 2025-05-07T20:24:01.2192384Z added / updated specs: 2025-05-07T20:24:01.2192638Z - python=3.10 2025-05-07T20:24:01.2192775Z 2025-05-07T20:24:01.2192780Z 2025-05-07T20:24:01.2192912Z The following packages will be downloaded: 2025-05-07T20:24:01.2193127Z 2025-05-07T20:24:01.2193260Z package | build 2025-05-07T20:24:01.2193583Z ---------------------------|----------------- 2025-05-07T20:24:01.2193943Z _libgcc_mutex-0.1 | main 3 KB 2025-05-07T20:24:01.2194330Z _openmp_mutex-5.1 | 1_gnu 21 KB 2025-05-07T20:24:01.2194744Z ca-certificates-2025.2.25 | h06a4308_0 129 KB 2025-05-07T20:24:01.2195153Z python-3.10.16 | he870216_1 26.9 MB 2025-05-07T20:24:01.2195543Z setuptools-78.1.1 | py310h06a4308_0 1.7 MB 2025-05-07T20:24:01.2196255Z wheel-0.45.1 | py310h06a4308_0 115 KB 2025-05-07T20:24:01.2196618Z ------------------------------------------------------------ 2025-05-07T20:24:01.2196950Z Total: 28.8 MB 2025-05-07T20:24:01.2197154Z 2025-05-07T20:24:01.2197280Z The following NEW packages will be INSTALLED: 2025-05-07T20:24:01.2197506Z 2025-05-07T20:24:01.2197903Z _libgcc_mutex pkgs/main/linux-64::_libgcc_mutex-0.1-main 2025-05-07T20:24:01.2198349Z _openmp_mutex pkgs/main/linux-64::_openmp_mutex-5.1-1_gnu 2025-05-07T20:24:01.2198767Z bzip2 pkgs/main/linux-64::bzip2-1.0.8-h5eee18b_6 2025-05-07T20:24:01.2199237Z ca-certificates pkgs/main/linux-64::ca-certificates-2025.2.25-h06a4308_0 2025-05-07T20:24:01.2199772Z ld_impl_linux-64 pkgs/main/linux-64::ld_impl_linux-64-2.40-h12ee557_0 2025-05-07T20:24:01.2200231Z libffi pkgs/main/linux-64::libffi-3.4.4-h6a678d5_1 2025-05-07T20:24:01.2200671Z libgcc-ng pkgs/main/linux-64::libgcc-ng-11.2.0-h1234567_1 2025-05-07T20:24:01.2201095Z libgomp pkgs/main/linux-64::libgomp-11.2.0-h1234567_1 2025-05-07T20:24:01.2201547Z libstdcxx-ng pkgs/main/linux-64::libstdcxx-ng-11.2.0-h1234567_1 2025-05-07T20:24:01.2201994Z libuuid pkgs/main/linux-64::libuuid-1.41.5-h5eee18b_0 2025-05-07T20:24:01.2202409Z ncurses pkgs/main/linux-64::ncurses-6.4-h6a678d5_0 2025-05-07T20:24:01.2202815Z openssl pkgs/main/linux-64::openssl-3.0.16-h5eee18b_0 2025-05-07T20:24:01.2203216Z pip pkgs/main/noarch::pip-25.1-pyhc872135_2 2025-05-07T20:24:01.2203611Z python pkgs/main/linux-64::python-3.10.16-he870216_1 2025-05-07T20:24:01.2204033Z readline pkgs/main/linux-64::readline-8.2-h5eee18b_0 2025-05-07T20:24:01.2204493Z setuptools pkgs/main/linux-64::setuptools-78.1.1-py310h06a4308_0 2025-05-07T20:24:01.2204961Z sqlite pkgs/main/linux-64::sqlite-3.45.3-h5eee18b_0 2025-05-07T20:24:01.2205345Z tk pkgs/main/linux-64::tk-8.6.14-h39e8969_0 2025-05-07T20:24:01.2205716Z tzdata pkgs/main/noarch::tzdata-2025b-h04d1e81_0 2025-05-07T20:24:01.2206130Z wheel pkgs/main/linux-64::wheel-0.45.1-py310h06a4308_0 2025-05-07T20:24:01.2206519Z xz pkgs/main/linux-64::xz-5.6.4-h5eee18b_1 2025-05-07T20:24:01.2206891Z zlib pkgs/main/linux-64::zlib-1.2.13-h5eee18b_1 2025-05-07T20:24:01.2207125Z 2025-05-07T20:24:01.2207130Z 2025-05-07T20:24:01.2207134Z 2025-05-07T20:24:01.2207277Z Downloading and Extracting Packages: ...working... 2025-05-07T20:24:01.2207648Z python-3.10.16 | 26.9 MB | | 0% 2025-05-07T20:24:01.2207879Z 2025-05-07T20:24:01.2208859Z setuptools-78.1.1 | 1.7 MB | | 0%  2025-05-07T20:24:01.2209106Z 2025-05-07T20:24:01.2209110Z 2025-05-07T20:24:01.2215992Z ca-certificates-2025 | 129 KB | | 0%  2025-05-07T20:24:01.2216264Z 2025-05-07T20:24:01.2216269Z 2025-05-07T20:24:01.2216273Z 2025-05-07T20:24:01.2237426Z wheel-0.45.1 | 115 KB | | 0%  2025-05-07T20:24:01.2237667Z 2025-05-07T20:24:01.2237766Z 2025-05-07T20:24:01.2237771Z 2025-05-07T20:24:01.2238135Z 2025-05-07T20:24:01.2245517Z _openmp_mutex-5.1 | 21 KB | | 0%  2025-05-07T20:24:01.2245787Z 2025-05-07T20:24:01.2245791Z 2025-05-07T20:24:01.2245805Z 2025-05-07T20:24:01.2245809Z 2025-05-07T20:24:01.2245813Z 2025-05-07T20:24:01.2558933Z _libgcc_mutex-0.1 | 3 KB | | 0%  2025-05-07T20:24:01.2559313Z 2025-05-07T20:24:01.2559319Z 2025-05-07T20:24:01.2559324Z 2025-05-07T20:24:01.2766817Z wheel-0.45.1 | 115 KB | ########## | 100%  2025-05-07T20:24:01.2767062Z 2025-05-07T20:24:01.2767066Z 2025-05-07T20:24:01.2767070Z 2025-05-07T20:24:01.2767074Z 2025-05-07T20:24:01.2772191Z 2025-05-07T20:24:01.2885181Z _libgcc_mutex-0.1 | 3 KB | ########## | 100%  2025-05-07T20:24:01.2885455Z 2025-05-07T20:24:01.2886839Z 2025-05-07T20:24:01.3033287Z ca-certificates-2025 | 129 KB | ########## | 100%  2025-05-07T20:24:01.3033567Z 2025-05-07T20:24:01.3033693Z 2025-05-07T20:24:01.3033697Z 2025-05-07T20:24:01.3033710Z 2025-05-07T20:24:01.3033955Z 2025-05-07T20:24:01.3194225Z _libgcc_mutex-0.1 | 3 KB | ########## | 100%  2025-05-07T20:24:01.3199858Z python-3.10.16 | 26.9 MB | 8 | 8% 2025-05-07T20:24:01.3200127Z 2025-05-07T20:24:01.3200132Z 2025-05-07T20:24:01.3200137Z 2025-05-07T20:24:01.3204874Z 2025-05-07T20:24:01.3209197Z _openmp_mutex-5.1 | 21 KB | ########## | 100%  2025-05-07T20:24:01.3209957Z 2025-05-07T20:24:01.3484979Z setuptools-78.1.1 | 1.7 MB | #####9 | 60%  2025-05-07T20:24:01.3485255Z 2025-05-07T20:24:01.3485988Z 2025-05-07T20:24:01.3496905Z ca-certificates-2025 | 129 KB | ########## | 100%  2025-05-07T20:24:01.3497263Z 2025-05-07T20:24:01.3497269Z 2025-05-07T20:24:01.3640409Z ca-certificates-2025 | 129 KB | ########## | 100%  2025-05-07T20:24:01.3640686Z 2025-05-07T20:24:01.3640690Z 2025-05-07T20:24:01.3641013Z 2025-05-07T20:24:01.3651436Z wheel-0.45.1 | 115 KB | ########## | 100%  2025-05-07T20:24:01.3651700Z 2025-05-07T20:24:01.3651705Z 2025-05-07T20:24:01.3653272Z 2025-05-07T20:24:01.3738429Z wheel-0.45.1 | 115 KB | ########## | 100%  2025-05-07T20:24:01.3741667Z 2025-05-07T20:24:01.4121561Z setuptools-78.1.1 | 1.7 MB | ########## | 100%  2025-05-07T20:24:01.4121813Z 2025-05-07T20:24:01.4121817Z 2025-05-07T20:24:01.4121821Z 2025-05-07T20:24:01.4121825Z 2025-05-07T20:24:01.4128083Z _openmp_mutex-5.1 | 21 KB | ########## | 100%  2025-05-07T20:24:01.4128345Z 2025-05-07T20:24:01.4128350Z 2025-05-07T20:24:01.4128354Z 2025-05-07T20:24:01.4128357Z 2025-05-07T20:24:01.4193680Z _openmp_mutex-5.1 | 21 KB | ########## | 100%  2025-05-07T20:24:01.5194687Z python-3.10.16 | 26.9 MB | ###2 | 33% 2025-05-07T20:24:01.5869176Z python-3.10.16 | 26.9 MB | ########9 | 89% 2025-05-07T20:24:01.7712765Z python-3.10.16 | 26.9 MB | ########## | 100% 2025-05-07T20:24:01.7713162Z 2025-05-07T20:24:02.2540070Z setuptools-78.1.1 | 1.7 MB | ########## | 100%  2025-05-07T20:24:02.2545864Z python-3.10.16 | 26.9 MB | ########## | 100% 2025-05-07T20:24:02.2546233Z 2025-05-07T20:24:02.2546439Z 2025-05-07T20:24:02.2546641Z  2025-05-07T20:24:02.2546837Z 2025-05-07T20:24:02.2546841Z 2025-05-07T20:24:02.2547013Z  2025-05-07T20:24:02.2547211Z 2025-05-07T20:24:02.2547215Z 2025-05-07T20:24:02.2547219Z 2025-05-07T20:24:02.2547393Z  2025-05-07T20:24:02.2547606Z 2025-05-07T20:24:02.2547610Z 2025-05-07T20:24:02.2547614Z 2025-05-07T20:24:02.2547617Z 2025-05-07T20:24:02.2547792Z  2025-05-07T20:24:02.2548011Z 2025-05-07T20:24:02.2548014Z 2025-05-07T20:24:02.2548019Z 2025-05-07T20:24:02.2548022Z 2025-05-07T20:24:02.2548026Z 2025-05-07T20:24:02.2548208Z  done 2025-05-07T20:24:02.4654400Z Preparing transaction: \ | done 2025-05-07T20:24:03.6274999Z Verifying transaction: - \ | / - \ | / - \ | done 2025-05-07T20:24:05.9491300Z Executing transaction: - \ | / - \ | / - \ | / - \ | / - \ | / - \ | done 2025-05-07T20:24:05.9994258Z # 2025-05-07T20:24:05.9994511Z # To activate this environment, use 2025-05-07T20:24:05.9994792Z # 2025-05-07T20:24:05.9994999Z # $ conda activate build_binary 2025-05-07T20:24:05.9995577Z # 2025-05-07T20:24:05.9995798Z # To deactivate an active environment, use 2025-05-07T20:24:05.9996085Z # 2025-05-07T20:24:05.9996274Z # $ conda deactivate 2025-05-07T20:24:05.9996436Z 2025-05-07T20:24:06.1090045Z [SETUP] Upgrading PIP to latest ... 2025-05-07T20:24:06.1112069Z [EXEC] [ATTEMPT 0/3] + conda run -n build_binary pip install --upgrade pip 2025-05-07T20:24:09.0546205Z Requirement already satisfied: pip in /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages (25.1) 2025-05-07T20:24:09.0547454Z Collecting pip 2025-05-07T20:24:09.0547791Z Downloading pip-25.1.1-py3-none-any.whl.metadata (3.6 kB) 2025-05-07T20:24:09.0548209Z Downloading pip-25.1.1-py3-none-any.whl (1.8 MB) 2025-05-07T20:24:09.0549046Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.8/1.8 MB 55.6 MB/s eta 0:00:00 2025-05-07T20:24:09.0549407Z Installing collected packages: pip 2025-05-07T20:24:09.0549696Z Attempting uninstall: pip 2025-05-07T20:24:09.0549984Z Found existing installation: pip 25.1 2025-05-07T20:24:09.0550312Z Uninstalling pip-25.1: 2025-05-07T20:24:09.0550585Z Successfully uninstalled pip-25.1 2025-05-07T20:24:09.0550896Z Successfully installed pip-25.1.1 2025-05-07T20:24:09.0551083Z 2025-05-07T20:24:09.1182449Z [SETUP] Upgrading pyOpenSSL ... 2025-05-07T20:24:09.1204874Z [EXEC] [ATTEMPT 0/3] + conda install -n build_binary -c conda-forge --override-channels -y pyOpenSSL>22.1.0 2025-05-07T20:24:09.9783999Z Channels: 2025-05-07T20:24:09.9784330Z - conda-forge 2025-05-07T20:24:09.9784594Z Platform: linux-64 2025-05-07T20:24:20.4777208Z Collecting package metadata (repodata.json): - \ | / - \ | / - \ | / - \ | / - \ done 2025-05-07T20:24:22.1049847Z Solving environment: / - \ | / - \ | done 2025-05-07T20:24:22.1649922Z 2025-05-07T20:24:22.1650273Z ## Package Plan ## 2025-05-07T20:24:22.1650494Z 2025-05-07T20:24:22.1650708Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:24:22.1651011Z 2025-05-07T20:24:22.1651150Z added / updated specs: 2025-05-07T20:24:22.1651413Z - pyopenssl[version='>22.1.0'] 2025-05-07T20:24:22.1651613Z 2025-05-07T20:24:22.1651617Z 2025-05-07T20:24:22.1651738Z The following packages will be downloaded: 2025-05-07T20:24:22.1651955Z 2025-05-07T20:24:22.1652070Z package | build 2025-05-07T20:24:22.1652394Z ---------------------------|----------------- 2025-05-07T20:24:22.1652771Z cffi-1.17.1 | py310h8deb56e_0 238 KB conda-forge 2025-05-07T20:24:22.1653216Z cryptography-44.0.3 | py310h6c63255_0 1.5 MB conda-forge 2025-05-07T20:24:22.1653663Z libgcc-15.1.0 | h767d61c_2 810 KB conda-forge 2025-05-07T20:24:22.1654074Z libgcc-ng-15.1.0 | h69a702a_2 34 KB conda-forge 2025-05-07T20:24:22.1654490Z libgomp-15.1.0 | h767d61c_2 442 KB conda-forge 2025-05-07T20:24:22.1654901Z openssl-3.5.0 | h7b32b05_1 3.0 MB conda-forge 2025-05-07T20:24:22.1655329Z pycparser-2.22 | pyh29332c3_1 108 KB conda-forge 2025-05-07T20:24:22.1655760Z pyopenssl-25.0.0 | pyhd8ed1ab_0 120 KB conda-forge 2025-05-07T20:24:22.1656188Z python_abi-3.10 | 2_cp310 4 KB conda-forge 2025-05-07T20:24:22.1656643Z typing-extensions-4.13.2 | h0e9735f_0 88 KB conda-forge 2025-05-07T20:24:22.1657132Z typing_extensions-4.13.2 | pyh29332c3_0 51 KB conda-forge 2025-05-07T20:24:22.1657551Z ------------------------------------------------------------ 2025-05-07T20:24:22.1657899Z Total: 6.3 MB 2025-05-07T20:24:22.1658106Z 2025-05-07T20:24:22.1658252Z The following NEW packages will be INSTALLED: 2025-05-07T20:24:22.1658472Z 2025-05-07T20:24:22.1658672Z cffi conda-forge/linux-64::cffi-1.17.1-py310h8deb56e_0 2025-05-07T20:24:22.1659484Z cryptography conda-forge/linux-64::cryptography-44.0.3-py310h6c63255_0 2025-05-07T20:24:22.1660059Z libgcc conda-forge/linux-64::libgcc-15.1.0-h767d61c_2 2025-05-07T20:24:22.1660511Z pycparser conda-forge/noarch::pycparser-2.22-pyh29332c3_1 2025-05-07T20:24:22.1660976Z pyopenssl conda-forge/noarch::pyopenssl-25.0.0-pyhd8ed1ab_0 2025-05-07T20:24:22.1661433Z python_abi conda-forge/linux-64::python_abi-3.10-2_cp310 2025-05-07T20:24:22.1662245Z typing-extensions conda-forge/noarch::typing-extensions-4.13.2-h0e9735f_0 2025-05-07T20:24:22.1662832Z typing_extensions conda-forge/noarch::typing_extensions-4.13.2-pyh29332c3_0 2025-05-07T20:24:22.1663163Z 2025-05-07T20:24:22.1663274Z The following packages will be UPDATED: 2025-05-07T20:24:22.1663481Z 2025-05-07T20:24:22.1663863Z ca-certificates pkgs/main/linux-64::ca-certificates-2~ --> conda-forge/noarch::ca-certificates-2025.4.26-hbd8a1cb_0 2025-05-07T20:24:22.1664609Z libgcc-ng pkgs/main::libgcc-ng-11.2.0-h1234567_1 --> conda-forge::libgcc-ng-15.1.0-h69a702a_2 2025-05-07T20:24:22.1665252Z libgomp pkgs/main::libgomp-11.2.0-h1234567_1 --> conda-forge::libgomp-15.1.0-h767d61c_2 2025-05-07T20:24:22.1665868Z openssl pkgs/main::openssl-3.0.16-h5eee18b_0 --> conda-forge::openssl-3.5.0-h7b32b05_1 2025-05-07T20:24:22.1666221Z 2025-05-07T20:24:22.1666225Z 2025-05-07T20:24:22.1666229Z 2025-05-07T20:24:22.1666378Z Downloading and Extracting Packages: ...working... 2025-05-07T20:24:22.1666754Z openssl-3.5.0 | 3.0 MB | | 0% 2025-05-07T20:24:22.1666981Z 2025-05-07T20:24:22.1667367Z cryptography-44.0.3 | 1.5 MB | | 0%  2025-05-07T20:24:22.1667607Z 2025-05-07T20:24:22.1667611Z 2025-05-07T20:24:22.1669257Z libgcc-15.1.0 | 810 KB | | 0%  2025-05-07T20:24:22.1669494Z 2025-05-07T20:24:22.1669564Z 2025-05-07T20:24:22.1674207Z 2025-05-07T20:24:22.1692349Z libgomp-15.1.0 | 442 KB | | 0%  2025-05-07T20:24:22.1692610Z 2025-05-07T20:24:22.1692613Z 2025-05-07T20:24:22.1692617Z 2025-05-07T20:24:22.1692621Z 2025-05-07T20:24:22.1713243Z cffi-1.17.1 | 238 KB | | 0%  2025-05-07T20:24:22.1713481Z 2025-05-07T20:24:22.1713485Z 2025-05-07T20:24:22.1713489Z 2025-05-07T20:24:22.1713493Z 2025-05-07T20:24:22.1713523Z 2025-05-07T20:24:22.1716445Z pyopenssl-25.0.0 | 120 KB | | 0%  2025-05-07T20:24:22.1716809Z 2025-05-07T20:24:22.1716815Z 2025-05-07T20:24:22.1716828Z 2025-05-07T20:24:22.1716834Z 2025-05-07T20:24:22.1716839Z 2025-05-07T20:24:22.1716844Z 2025-05-07T20:24:22.1717122Z pycparser-2.22 | 108 KB | | 0%  2025-05-07T20:24:22.1717445Z 2025-05-07T20:24:22.1717450Z 2025-05-07T20:24:22.1717454Z 2025-05-07T20:24:22.1717466Z 2025-05-07T20:24:22.1717471Z 2025-05-07T20:24:22.1717476Z 2025-05-07T20:24:22.1717479Z 2025-05-07T20:24:22.1727012Z typing-extensions-4. | 88 KB | | 0%  2025-05-07T20:24:22.1727356Z 2025-05-07T20:24:22.1727361Z 2025-05-07T20:24:22.1727366Z 2025-05-07T20:24:22.1727369Z 2025-05-07T20:24:22.1727373Z 2025-05-07T20:24:22.1727377Z 2025-05-07T20:24:22.1727381Z 2025-05-07T20:24:22.1727384Z 2025-05-07T20:24:22.1727913Z typing_extensions-4. | 51 KB | | 0%  2025-05-07T20:24:22.1728222Z 2025-05-07T20:24:22.1728227Z 2025-05-07T20:24:22.1728231Z 2025-05-07T20:24:22.1728246Z 2025-05-07T20:24:22.1728251Z 2025-05-07T20:24:22.1728271Z 2025-05-07T20:24:22.1728275Z 2025-05-07T20:24:22.1728278Z 2025-05-07T20:24:22.1732221Z 2025-05-07T20:24:22.1740363Z libgcc-ng-15.1.0 | 34 KB | | 0%  2025-05-07T20:24:22.1740767Z 2025-05-07T20:24:22.1740773Z 2025-05-07T20:24:22.1740779Z 2025-05-07T20:24:22.1740784Z 2025-05-07T20:24:22.1740790Z 2025-05-07T20:24:22.1740795Z 2025-05-07T20:24:22.1740801Z 2025-05-07T20:24:22.1741079Z 2025-05-07T20:24:22.1741085Z 2025-05-07T20:24:22.1741090Z 2025-05-07T20:24:22.2560774Z python_abi-3.10 | 4 KB | | 0%  2025-05-07T20:24:22.2561185Z 2025-05-07T20:24:22.2561190Z 2025-05-07T20:24:22.2561194Z 2025-05-07T20:24:22.2561198Z 2025-05-07T20:24:22.2654982Z cffi-1.17.1 | 238 KB | ########## | 100%  2025-05-07T20:24:22.2667316Z openssl-3.5.0 | 3.0 MB | #####6 | 57% 2025-05-07T20:24:22.2668009Z 2025-05-07T20:24:22.2668512Z 2025-05-07T20:24:22.2688042Z libgcc-15.1.0 | 810 KB | ####7 | 47%  2025-05-07T20:24:22.2688759Z 2025-05-07T20:24:22.2708871Z cryptography-44.0.3 | 1.5 MB | ###1 | 31%  2025-05-07T20:24:22.2709143Z 2025-05-07T20:24:22.2709147Z 2025-05-07T20:24:22.2709151Z 2025-05-07T20:24:22.2709382Z libgomp-15.1.0 | 442 KB | ########## | 100%  2025-05-07T20:24:22.2709633Z 2025-05-07T20:24:22.2709638Z 2025-05-07T20:24:22.2709646Z 2025-05-07T20:24:22.3009277Z libgomp-15.1.0 | 442 KB | ########## | 100%  2025-05-07T20:24:22.3009608Z 2025-05-07T20:24:22.3012290Z 2025-05-07T20:24:22.3108693Z libgcc-15.1.0 | 810 KB | ########## | 100%  2025-05-07T20:24:22.3108975Z 2025-05-07T20:24:22.3108979Z 2025-05-07T20:24:22.3108983Z 2025-05-07T20:24:22.3108987Z 2025-05-07T20:24:22.3109404Z 2025-05-07T20:24:22.3192600Z pyopenssl-25.0.0 | 120 KB | #3 | 13%  2025-05-07T20:24:22.3193020Z 2025-05-07T20:24:22.3193026Z 2025-05-07T20:24:22.3193032Z 2025-05-07T20:24:22.3193037Z 2025-05-07T20:24:22.3193042Z 2025-05-07T20:24:22.3211546Z pyopenssl-25.0.0 | 120 KB | ########## | 100%  2025-05-07T20:24:22.3211842Z 2025-05-07T20:24:22.3211846Z 2025-05-07T20:24:22.3211850Z 2025-05-07T20:24:22.3211854Z 2025-05-07T20:24:22.3211858Z 2025-05-07T20:24:22.3213766Z 2025-05-07T20:24:22.3298471Z pycparser-2.22 | 108 KB | #4 | 15%  2025-05-07T20:24:22.3298866Z 2025-05-07T20:24:22.3298873Z 2025-05-07T20:24:22.3298877Z 2025-05-07T20:24:22.3298882Z 2025-05-07T20:24:22.3298887Z 2025-05-07T20:24:22.3304098Z 2025-05-07T20:24:22.3387403Z pycparser-2.22 | 108 KB | ########## | 100%  2025-05-07T20:24:22.3388936Z 2025-05-07T20:24:22.3519594Z cryptography-44.0.3 | 1.5 MB | ########## | 100%  2025-05-07T20:24:22.3519868Z 2025-05-07T20:24:22.3519872Z 2025-05-07T20:24:22.3519876Z 2025-05-07T20:24:22.3519880Z 2025-05-07T20:24:22.3519898Z 2025-05-07T20:24:22.3519902Z 2025-05-07T20:24:22.3523613Z 2025-05-07T20:24:22.3590455Z typing-extensions-4. | 88 KB | #8 | 18%  2025-05-07T20:24:22.3590867Z 2025-05-07T20:24:22.3590871Z 2025-05-07T20:24:22.3590875Z 2025-05-07T20:24:22.3590879Z 2025-05-07T20:24:22.3590883Z 2025-05-07T20:24:22.3590897Z 2025-05-07T20:24:22.3592253Z 2025-05-07T20:24:22.3714101Z typing-extensions-4. | 88 KB | ########## | 100%  2025-05-07T20:24:22.3714425Z 2025-05-07T20:24:22.3714429Z 2025-05-07T20:24:22.3714440Z 2025-05-07T20:24:22.3714444Z 2025-05-07T20:24:22.3714447Z 2025-05-07T20:24:22.3714451Z 2025-05-07T20:24:22.3714455Z 2025-05-07T20:24:22.3714533Z 2025-05-07T20:24:22.3764790Z typing_extensions-4. | 51 KB | ###1 | 31%  2025-05-07T20:24:22.3765309Z 2025-05-07T20:24:22.3765316Z 2025-05-07T20:24:22.3765321Z 2025-05-07T20:24:22.3765326Z 2025-05-07T20:24:22.3765332Z 2025-05-07T20:24:22.3765352Z 2025-05-07T20:24:22.3765358Z 2025-05-07T20:24:22.3767161Z 2025-05-07T20:24:22.3777504Z typing_extensions-4. | 51 KB | ########## | 100%  2025-05-07T20:24:22.3896903Z openssl-3.5.0 | 3.0 MB | ########## | 100% 2025-05-07T20:24:22.3897177Z 2025-05-07T20:24:22.3897181Z 2025-05-07T20:24:22.3897185Z 2025-05-07T20:24:22.3897189Z 2025-05-07T20:24:22.3897193Z 2025-05-07T20:24:22.3897197Z 2025-05-07T20:24:22.3897201Z 2025-05-07T20:24:22.3897205Z 2025-05-07T20:24:22.3897424Z 2025-05-07T20:24:22.3899395Z 2025-05-07T20:24:22.3912746Z python_abi-3.10 | 4 KB | ########## | 100%  2025-05-07T20:24:22.3913206Z 2025-05-07T20:24:22.3913213Z 2025-05-07T20:24:22.3913219Z 2025-05-07T20:24:22.3913224Z 2025-05-07T20:24:22.3913230Z 2025-05-07T20:24:22.3913235Z 2025-05-07T20:24:22.3913241Z 2025-05-07T20:24:22.3913246Z 2025-05-07T20:24:22.3913251Z 2025-05-07T20:24:22.3915679Z 2025-05-07T20:24:22.3975715Z python_abi-3.10 | 4 KB | ########## | 100%  2025-05-07T20:24:22.3976012Z 2025-05-07T20:24:22.3976016Z 2025-05-07T20:24:22.3976020Z 2025-05-07T20:24:22.3979333Z 2025-05-07T20:24:22.3986131Z cffi-1.17.1 | 238 KB | ########## | 100%  2025-05-07T20:24:22.3986519Z 2025-05-07T20:24:22.3986525Z 2025-05-07T20:24:22.3986531Z 2025-05-07T20:24:22.3986536Z 2025-05-07T20:24:22.3986541Z 2025-05-07T20:24:22.3986546Z 2025-05-07T20:24:22.3986552Z 2025-05-07T20:24:22.3986568Z 2025-05-07T20:24:22.3988821Z 2025-05-07T20:24:22.3991876Z libgcc-ng-15.1.0 | 34 KB | ####7 | 47%  2025-05-07T20:24:22.3992228Z 2025-05-07T20:24:22.3992235Z 2025-05-07T20:24:22.3992249Z 2025-05-07T20:24:22.3992255Z 2025-05-07T20:24:22.4020746Z cffi-1.17.1 | 238 KB | ########## | 100%  2025-05-07T20:24:22.4021128Z 2025-05-07T20:24:22.4021134Z 2025-05-07T20:24:22.4021139Z 2025-05-07T20:24:22.4021152Z 2025-05-07T20:24:22.4021157Z 2025-05-07T20:24:22.4021175Z 2025-05-07T20:24:22.4021181Z 2025-05-07T20:24:22.4021186Z 2025-05-07T20:24:22.4023265Z 2025-05-07T20:24:22.4077213Z libgcc-ng-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:24:22.4077575Z 2025-05-07T20:24:22.4077580Z 2025-05-07T20:24:22.4079168Z 2025-05-07T20:24:22.4357952Z libgomp-15.1.0 | 442 KB | ########## | 100%  2025-05-07T20:24:22.4358285Z 2025-05-07T20:24:22.4358292Z 2025-05-07T20:24:22.4358297Z 2025-05-07T20:24:22.4358334Z 2025-05-07T20:24:22.4358339Z 2025-05-07T20:24:22.4362362Z pyopenssl-25.0.0 | 120 KB | ########## | 100%  2025-05-07T20:24:22.4362747Z 2025-05-07T20:24:22.4362753Z 2025-05-07T20:24:22.4362758Z 2025-05-07T20:24:22.4362763Z 2025-05-07T20:24:22.4362768Z 2025-05-07T20:24:22.4447959Z pyopenssl-25.0.0 | 120 KB | ########## | 100%  2025-05-07T20:24:22.4448402Z 2025-05-07T20:24:22.4448619Z 2025-05-07T20:24:22.4455611Z libgcc-15.1.0 | 810 KB | ########## | 100%  2025-05-07T20:24:22.4455974Z 2025-05-07T20:24:22.4455991Z 2025-05-07T20:24:22.4682941Z libgcc-15.1.0 | 810 KB | ########## | 100%  2025-05-07T20:24:22.4683242Z 2025-05-07T20:24:22.4683246Z 2025-05-07T20:24:22.4683250Z 2025-05-07T20:24:22.4683254Z 2025-05-07T20:24:22.4683265Z 2025-05-07T20:24:22.4683269Z 2025-05-07T20:24:22.4685366Z 2025-05-07T20:24:22.4691481Z typing-extensions-4. | 88 KB | ########## | 100%  2025-05-07T20:24:22.4691963Z 2025-05-07T20:24:22.4691980Z 2025-05-07T20:24:22.4691985Z 2025-05-07T20:24:22.4691990Z 2025-05-07T20:24:22.4691995Z 2025-05-07T20:24:22.4692000Z 2025-05-07T20:24:22.4692005Z 2025-05-07T20:24:22.4932751Z typing-extensions-4. | 88 KB | ########## | 100%  2025-05-07T20:24:22.4933252Z 2025-05-07T20:24:22.4933259Z 2025-05-07T20:24:22.4933264Z 2025-05-07T20:24:22.4933269Z 2025-05-07T20:24:22.4933275Z 2025-05-07T20:24:22.4933280Z 2025-05-07T20:24:22.4933285Z 2025-05-07T20:24:22.4934007Z 2025-05-07T20:24:22.4941206Z typing_extensions-4. | 51 KB | ########## | 100%  2025-05-07T20:24:22.4941635Z 2025-05-07T20:24:22.4941639Z 2025-05-07T20:24:22.4941643Z 2025-05-07T20:24:22.4941646Z 2025-05-07T20:24:22.4941650Z 2025-05-07T20:24:22.4941654Z 2025-05-07T20:24:22.4941658Z 2025-05-07T20:24:22.4941662Z 2025-05-07T20:24:22.5689588Z typing_extensions-4. | 51 KB | ########## | 100%  2025-05-07T20:24:22.5690270Z 2025-05-07T20:24:22.5690552Z 2025-05-07T20:24:22.5690556Z 2025-05-07T20:24:22.5690560Z 2025-05-07T20:24:22.5690564Z 2025-05-07T20:24:22.5690568Z 2025-05-07T20:24:22.5694316Z pycparser-2.22 | 108 KB | ########## | 100%  2025-05-07T20:24:22.5694662Z 2025-05-07T20:24:22.5694667Z 2025-05-07T20:24:22.5694670Z 2025-05-07T20:24:22.5694674Z 2025-05-07T20:24:22.5694678Z 2025-05-07T20:24:22.5694682Z 2025-05-07T20:24:22.5807687Z pycparser-2.22 | 108 KB | ########## | 100%  2025-05-07T20:24:22.5808038Z 2025-05-07T20:24:22.5808043Z 2025-05-07T20:24:22.5808049Z 2025-05-07T20:24:22.5808054Z 2025-05-07T20:24:22.5808060Z 2025-05-07T20:24:22.5808064Z 2025-05-07T20:24:22.5808070Z 2025-05-07T20:24:22.5808075Z 2025-05-07T20:24:22.5808080Z 2025-05-07T20:24:22.5808085Z 2025-05-07T20:24:22.6229922Z python_abi-3.10 | 4 KB | ########## | 100%  2025-05-07T20:24:22.6230230Z 2025-05-07T20:24:22.6230234Z 2025-05-07T20:24:22.6230262Z 2025-05-07T20:24:22.6230266Z 2025-05-07T20:24:22.6230270Z 2025-05-07T20:24:22.6230273Z 2025-05-07T20:24:22.6230283Z 2025-05-07T20:24:22.6230287Z 2025-05-07T20:24:22.6230292Z 2025-05-07T20:24:22.6233957Z libgcc-ng-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:24:22.6234331Z 2025-05-07T20:24:22.6234347Z 2025-05-07T20:24:22.6234351Z 2025-05-07T20:24:22.6234354Z 2025-05-07T20:24:22.6234358Z 2025-05-07T20:24:22.6234362Z 2025-05-07T20:24:22.6234365Z 2025-05-07T20:24:22.6234382Z 2025-05-07T20:24:22.6234386Z 2025-05-07T20:24:22.6971604Z libgcc-ng-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:24:22.7047232Z openssl-3.5.0 | 3.0 MB | ########## | 100% 2025-05-07T20:24:22.7047476Z 2025-05-07T20:24:22.7048369Z cryptography-44.0.3 | 1.5 MB | ########## | 100%  2025-05-07T20:24:22.7048616Z 2025-05-07T20:24:22.7056458Z cryptography-44.0.3 | 1.5 MB | ########## | 100%  2025-05-07T20:24:22.7056819Z 2025-05-07T20:24:22.7057080Z 2025-05-07T20:24:22.7057242Z  2025-05-07T20:24:22.7057462Z 2025-05-07T20:24:22.7057467Z 2025-05-07T20:24:22.7057631Z  2025-05-07T20:24:22.7057830Z 2025-05-07T20:24:22.7057834Z 2025-05-07T20:24:22.7057847Z 2025-05-07T20:24:22.7058015Z  2025-05-07T20:24:22.7058227Z 2025-05-07T20:24:22.7058231Z 2025-05-07T20:24:22.7058235Z 2025-05-07T20:24:22.7058239Z 2025-05-07T20:24:22.7058416Z  2025-05-07T20:24:22.7058620Z 2025-05-07T20:24:22.7058624Z 2025-05-07T20:24:22.7058627Z 2025-05-07T20:24:22.7058631Z 2025-05-07T20:24:22.7058635Z 2025-05-07T20:24:22.7058815Z  2025-05-07T20:24:22.7059022Z 2025-05-07T20:24:22.7059026Z 2025-05-07T20:24:22.7059034Z 2025-05-07T20:24:22.7059037Z 2025-05-07T20:24:22.7059041Z 2025-05-07T20:24:22.7059044Z 2025-05-07T20:24:22.7059230Z  2025-05-07T20:24:22.7059438Z 2025-05-07T20:24:22.7059442Z 2025-05-07T20:24:22.7059446Z 2025-05-07T20:24:22.7059449Z 2025-05-07T20:24:22.7059453Z 2025-05-07T20:24:22.7059457Z 2025-05-07T20:24:22.7059460Z 2025-05-07T20:24:22.7059649Z  2025-05-07T20:24:22.7060003Z 2025-05-07T20:24:22.7060007Z 2025-05-07T20:24:22.7060010Z 2025-05-07T20:24:22.7060014Z 2025-05-07T20:24:22.7060017Z 2025-05-07T20:24:22.7060021Z 2025-05-07T20:24:22.7060024Z 2025-05-07T20:24:22.7060028Z 2025-05-07T20:24:22.7060223Z  2025-05-07T20:24:22.7060435Z 2025-05-07T20:24:22.7060439Z 2025-05-07T20:24:22.7060442Z 2025-05-07T20:24:22.7060446Z 2025-05-07T20:24:22.7060449Z 2025-05-07T20:24:22.7060695Z 2025-05-07T20:24:22.7060699Z 2025-05-07T20:24:22.7060702Z 2025-05-07T20:24:22.7060706Z 2025-05-07T20:24:22.7060916Z  2025-05-07T20:24:22.7061128Z 2025-05-07T20:24:22.7061132Z 2025-05-07T20:24:22.7061135Z 2025-05-07T20:24:22.7061139Z 2025-05-07T20:24:22.7061142Z 2025-05-07T20:24:22.7061146Z 2025-05-07T20:24:22.7061149Z 2025-05-07T20:24:22.7061153Z 2025-05-07T20:24:22.7061163Z 2025-05-07T20:24:22.7061338Z 2025-05-07T20:24:22.7061550Z  done 2025-05-07T20:24:22.8062333Z Preparing transaction: - done 2025-05-07T20:24:22.9067516Z Verifying transaction: | done 2025-05-07T20:24:24.4092299Z Executing transaction: - \ | / - \ | / - \ | / - \ | done 2025-05-07T20:24:24.5889415Z [SETUP] Testing pyOpenSSL import ... 2025-05-07T20:24:26.3309730Z [CHECK] Python (sub-)package 'OpenSSL' found ... 2025-05-07T20:24:26.3322876Z [SETUP] Installing libxcrypt ... 2025-05-07T20:24:26.3346050Z [EXEC] [ATTEMPT 0/3] + conda install -n build_binary -c conda-forge --override-channels -y libxcrypt 2025-05-07T20:24:27.1964568Z Channels: 2025-05-07T20:24:27.1964809Z - conda-forge 2025-05-07T20:24:27.1965035Z Platform: linux-64 2025-05-07T20:24:30.5066142Z Collecting package metadata (repodata.json): - \ | / done 2025-05-07T20:24:30.8740905Z Solving environment: \ done 2025-05-07T20:24:30.9346666Z 2025-05-07T20:24:30.9347220Z ## Package Plan ## 2025-05-07T20:24:30.9347402Z 2025-05-07T20:24:30.9347614Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:24:30.9347923Z 2025-05-07T20:24:30.9348024Z added / updated specs: 2025-05-07T20:24:30.9348272Z - libxcrypt 2025-05-07T20:24:30.9348405Z 2025-05-07T20:24:30.9348410Z 2025-05-07T20:24:30.9348537Z The following packages will be downloaded: 2025-05-07T20:24:30.9348752Z 2025-05-07T20:24:30.9348869Z package | build 2025-05-07T20:24:30.9349208Z ---------------------------|----------------- 2025-05-07T20:24:30.9349589Z libxcrypt-4.4.36 | hd590300_1 98 KB conda-forge 2025-05-07T20:24:30.9349992Z ------------------------------------------------------------ 2025-05-07T20:24:30.9350330Z Total: 98 KB 2025-05-07T20:24:30.9350542Z 2025-05-07T20:24:30.9350677Z The following NEW packages will be INSTALLED: 2025-05-07T20:24:30.9350901Z 2025-05-07T20:24:30.9351130Z libxcrypt conda-forge/linux-64::libxcrypt-4.4.36-hd590300_1 2025-05-07T20:24:30.9351417Z 2025-05-07T20:24:30.9351421Z 2025-05-07T20:24:30.9351426Z 2025-05-07T20:24:30.9351576Z Downloading and Extracting Packages: ...working... 2025-05-07T20:24:31.0721398Z libxcrypt-4.4.36 | 98 KB | | 0% 2025-05-07T20:24:31.0743200Z libxcrypt-4.4.36 | 98 KB | #6 | 16% 2025-05-07T20:24:31.0848555Z libxcrypt-4.4.36 | 98 KB | ########## | 100% 2025-05-07T20:24:31.0851675Z libxcrypt-4.4.36 | 98 KB | ########## | 100% 2025-05-07T20:24:31.0852071Z 2025-05-07T20:24:31.0852340Z done 2025-05-07T20:24:31.1857556Z Preparing transaction: / done 2025-05-07T20:24:31.2861805Z Verifying transaction: \ done 2025-05-07T20:24:31.3867020Z Executing transaction: / done 2025-05-07T20:24:34.8339173Z [SETUP] Copying over ... 2025-05-07T20:24:34.8339981Z + cp /home/ec2-user/miniconda/envs/build_binary/include/crypt.h /home/ec2-user/miniconda/envs/build_binary/include/python3.10/crypt.h 2025-05-07T20:24:34.8340528Z 2025-05-07T20:24:34.8371479Z 2025-05-07T20:24:36.4692656Z [SETUP] Installed Python version: Python 3.10.16 2025-05-07T20:24:36.4693122Z [SETUP] Successfully created Conda environment: build_binary 2025-05-07T20:24:36.4726989Z ##[group]Run . $PRELUDE; install_cxx_compiler $BUILD_ENV gcc 2025-05-07T20:24:36.4727473Z . $PRELUDE; install_cxx_compiler $BUILD_ENV gcc 2025-05-07T20:24:36.4742166Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:24:36.4742514Z env: 2025-05-07T20:24:36.4742740Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:24:36.4743036Z BUILD_ENV: build_binary 2025-05-07T20:24:36.4743276Z BUILD_TARGET: genai 2025-05-07T20:24:36.4743506Z BUILD_VARIANT: cuda 2025-05-07T20:24:36.4743740Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:24:36.4743994Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:24:36.4744294Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:24:36.4744612Z ##[endgroup] 2025-05-07T20:24:36.8180631Z ################################################################################ 2025-05-07T20:24:36.8180994Z # Install C/C++ Compilers 2025-05-07T20:24:36.8181228Z # 2025-05-07T20:24:36.8198254Z # [2025-05-07T20:24:36.819Z] + install_cxx_compiler build_binary gcc 2025-05-07T20:24:36.8198686Z ################################################################################ 2025-05-07T20:24:36.8206904Z 2025-05-07T20:24:36.8216006Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:24:36.9188371Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:24:36.9199457Z [INSTALL] Installing GLIBC (architecture = 64) ... 2025-05-07T20:24:36.9222807Z [EXEC] [ATTEMPT 0/3] + conda install -n build_binary -c conda-forge --override-channels -y sysroot_linux-64=2.17 2025-05-07T20:24:37.7899093Z Channels: 2025-05-07T20:24:37.7899354Z - conda-forge 2025-05-07T20:24:37.7899598Z Platform: linux-64 2025-05-07T20:24:41.1291177Z Collecting package metadata (repodata.json): - \ | / done 2025-05-07T20:24:41.4961487Z Solving environment: \ done 2025-05-07T20:24:41.5576640Z 2025-05-07T20:24:41.5577054Z ## Package Plan ## 2025-05-07T20:24:41.5577211Z 2025-05-07T20:24:41.5577465Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:24:41.5577788Z 2025-05-07T20:24:41.5577892Z added / updated specs: 2025-05-07T20:24:41.5578182Z - sysroot_linux-64=2.17 2025-05-07T20:24:41.5578354Z 2025-05-07T20:24:41.5578358Z 2025-05-07T20:24:41.5578500Z The following packages will be downloaded: 2025-05-07T20:24:41.5578716Z 2025-05-07T20:24:41.5578851Z package | build 2025-05-07T20:24:41.5579175Z ---------------------------|----------------- 2025-05-07T20:24:41.5579604Z kernel-headers_linux-64-3.10.0| he073ed8_18 921 KB conda-forge 2025-05-07T20:24:41.5580401Z sysroot_linux-64-2.17 | h0157908_18 14.5 MB conda-forge 2025-05-07T20:24:41.5581008Z ------------------------------------------------------------ 2025-05-07T20:24:41.5581510Z Total: 15.4 MB 2025-05-07T20:24:41.5581831Z 2025-05-07T20:24:41.5582009Z The following NEW packages will be INSTALLED: 2025-05-07T20:24:41.5582332Z 2025-05-07T20:24:41.5582697Z kernel-headers_li~ conda-forge/noarch::kernel-headers_linux-64-3.10.0-he073ed8_18 2025-05-07T20:24:41.5583265Z sysroot_linux-64 conda-forge/noarch::sysroot_linux-64-2.17-h0157908_18 2025-05-07T20:24:41.5583574Z 2025-05-07T20:24:41.5583578Z 2025-05-07T20:24:41.5583583Z 2025-05-07T20:24:41.5583727Z Downloading and Extracting Packages: ...working... 2025-05-07T20:24:41.5584114Z sysroot_linux-64-2.1 | 14.5 MB | | 0% 2025-05-07T20:24:41.5584341Z 2025-05-07T20:24:41.7808983Z kernel-headers_linux | 921 KB | | 0%  2025-05-07T20:24:41.8110450Z sysroot_linux-64-2.1 | 14.5 MB | | 0% 2025-05-07T20:24:41.8111059Z 2025-05-07T20:24:41.8187782Z kernel-headers_linux | 921 KB | 1 | 2%  2025-05-07T20:24:41.8190141Z 2025-05-07T20:24:41.8820374Z kernel-headers_linux | 921 KB | ########## | 100%  2025-05-07T20:24:41.9604733Z sysroot_linux-64-2.1 | 14.5 MB | ######2 | 62% 2025-05-07T20:24:42.0646092Z sysroot_linux-64-2.1 | 14.5 MB | ########## | 100% 2025-05-07T20:24:42.0646350Z 2025-05-07T20:24:42.0648204Z kernel-headers_linux | 921 KB | ########## | 100%  2025-05-07T20:24:42.0648656Z 2025-05-07T20:24:42.5398066Z kernel-headers_linux | 921 KB | ########## | 100%  2025-05-07T20:24:42.5398628Z sysroot_linux-64-2.1 | 14.5 MB | ########## | 100% 2025-05-07T20:24:42.5403977Z sysroot_linux-64-2.1 | 14.5 MB | ########## | 100% 2025-05-07T20:24:42.5404320Z 2025-05-07T20:24:42.5404518Z 2025-05-07T20:24:42.5405689Z  done 2025-05-07T20:24:42.6408795Z Preparing transaction: / done 2025-05-07T20:24:42.8415329Z Verifying transaction: \ | done 2025-05-07T20:24:43.0485938Z Executing transaction: - \ done 2025-05-07T20:24:43.2043525Z [CHECK] LD_LIBRARY_PATH = 2025-05-07T20:24:43.2043890Z [CHECK] CONDA_PREFIX is not set. 2025-05-07T20:24:44.8886498Z [CHECK] libstdc++.so.6 found in CONDA_PREFIX PATH (symbolic link): /home/ec2-user/miniconda/envs/build_binary/lib/libstdc++.so.6 2025-05-07T20:24:44.8900184Z [INSTALL] Installing GCC (11.4.0, 64) through Conda ... 2025-05-07T20:24:44.8924035Z [EXEC] [ATTEMPT 0/3] + conda install -n build_binary -c conda-forge --override-channels -y gxx_linux-64=11.4.0 2025-05-07T20:24:45.7765260Z Channels: 2025-05-07T20:24:45.7765874Z - conda-forge 2025-05-07T20:24:45.7766391Z Platform: linux-64 2025-05-07T20:24:49.0827420Z Collecting package metadata (repodata.json): - \ | / done 2025-05-07T20:24:50.0377440Z Solving environment: \ | / done 2025-05-07T20:24:50.1020776Z 2025-05-07T20:24:50.1021060Z ## Package Plan ## 2025-05-07T20:24:50.1021269Z 2025-05-07T20:24:50.1021567Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:24:50.1022054Z 2025-05-07T20:24:50.1022164Z added / updated specs: 2025-05-07T20:24:50.1022422Z - gxx_linux-64=11.4.0 2025-05-07T20:24:50.1022594Z 2025-05-07T20:24:50.1022654Z 2025-05-07T20:24:50.1022780Z The following packages will be downloaded: 2025-05-07T20:24:50.1023004Z 2025-05-07T20:24:50.1023121Z package | build 2025-05-07T20:24:50.1023460Z ---------------------------|----------------- 2025-05-07T20:24:50.1023993Z binutils_impl_linux-64-2.40| ha1999f0_7 6.0 MB conda-forge 2025-05-07T20:24:50.1024473Z binutils_linux-64-2.40 | hb3c18ed_4 28 KB conda-forge 2025-05-07T20:24:50.1024942Z gcc_impl_linux-64-11.4.0 | h00c12a0_13 53.0 MB conda-forge 2025-05-07T20:24:50.1025392Z gcc_linux-64-11.4.0 | ha077dfb_4 31 KB conda-forge 2025-05-07T20:24:50.1025829Z gxx_impl_linux-64-11.4.0 | h634f3ee_13 11.2 MB conda-forge 2025-05-07T20:24:50.1026380Z gxx_linux-64-11.4.0 | h35bfe5d_4 29 KB conda-forge 2025-05-07T20:24:50.1026816Z ld_impl_linux-64-2.40 | hf3520f5_7 691 KB conda-forge 2025-05-07T20:24:50.1027290Z libgcc-devel_linux-64-11.4.0| h8f596e0_113 2.3 MB conda-forge 2025-05-07T20:24:50.1027758Z libsanitizer-11.4.0 | h5763a12_13 3.5 MB conda-forge 2025-05-07T20:24:50.1028323Z libstdcxx-15.1.0 | h8f9b012_2 3.7 MB conda-forge 2025-05-07T20:24:50.1028795Z libstdcxx-devel_linux-64-11.4.0| h8f596e0_113 11.1 MB conda-forge 2025-05-07T20:24:50.1029264Z libstdcxx-ng-15.1.0 | h4852527_2 34 KB conda-forge 2025-05-07T20:24:50.1029666Z ------------------------------------------------------------ 2025-05-07T20:24:50.1030009Z Total: 91.6 MB 2025-05-07T20:24:50.1030214Z 2025-05-07T20:24:50.1030352Z The following NEW packages will be INSTALLED: 2025-05-07T20:24:50.1030569Z 2025-05-07T20:24:50.1030833Z binutils_impl_lin~ conda-forge/linux-64::binutils_impl_linux-64-2.40-ha1999f0_7 2025-05-07T20:24:50.1031710Z binutils_linux-64 conda-forge/linux-64::binutils_linux-64-2.40-hb3c18ed_4 2025-05-07T20:24:50.1032256Z gcc_impl_linux-64 conda-forge/linux-64::gcc_impl_linux-64-11.4.0-h00c12a0_13 2025-05-07T20:24:50.1033065Z gcc_linux-64 conda-forge/linux-64::gcc_linux-64-11.4.0-ha077dfb_4 2025-05-07T20:24:50.1033562Z gxx_impl_linux-64 conda-forge/linux-64::gxx_impl_linux-64-11.4.0-h634f3ee_13 2025-05-07T20:24:50.1034192Z gxx_linux-64 conda-forge/linux-64::gxx_linux-64-11.4.0-h35bfe5d_4 2025-05-07T20:24:50.1034713Z libgcc-devel_linu~ conda-forge/noarch::libgcc-devel_linux-64-11.4.0-h8f596e0_113 2025-05-07T20:24:50.1035270Z libsanitizer conda-forge/linux-64::libsanitizer-11.4.0-h5763a12_13 2025-05-07T20:24:50.1035752Z libstdcxx conda-forge/linux-64::libstdcxx-15.1.0-h8f9b012_2 2025-05-07T20:24:50.1036292Z libstdcxx-devel_l~ conda-forge/noarch::libstdcxx-devel_linux-64-11.4.0-h8f596e0_113 2025-05-07T20:24:50.1036646Z 2025-05-07T20:24:50.1036767Z The following packages will be UPDATED: 2025-05-07T20:24:50.1036975Z 2025-05-07T20:24:50.1037287Z ld_impl_linux-64 pkgs/main::ld_impl_linux-64-2.40-h12e~ --> conda-forge::ld_impl_linux-64-2.40-hf3520f5_7 2025-05-07T20:24:50.1037998Z libstdcxx-ng pkgs/main::libstdcxx-ng-11.2.0-h12345~ --> conda-forge::libstdcxx-ng-15.1.0-h4852527_2 2025-05-07T20:24:50.1038409Z 2025-05-07T20:24:50.1038414Z 2025-05-07T20:24:50.1038418Z 2025-05-07T20:24:50.1038557Z Downloading and Extracting Packages: ...working... 2025-05-07T20:24:50.1038932Z gcc_impl_linux-64-11 | 53.0 MB | | 0% 2025-05-07T20:24:50.1039156Z 2025-05-07T20:24:50.1039526Z gxx_impl_linux-64-11 | 11.2 MB | | 0%  2025-05-07T20:24:50.1039758Z 2025-05-07T20:24:50.1039762Z 2025-05-07T20:24:50.1048118Z libstdcxx-devel_linu | 11.1 MB | | 0%  2025-05-07T20:24:50.1048373Z 2025-05-07T20:24:50.1048377Z 2025-05-07T20:24:50.1048384Z 2025-05-07T20:24:50.1069877Z binutils_impl_linux- | 6.0 MB | | 0%  2025-05-07T20:24:50.1070162Z 2025-05-07T20:24:50.1070166Z 2025-05-07T20:24:50.1070170Z 2025-05-07T20:24:50.1070174Z 2025-05-07T20:24:50.1090125Z libstdcxx-15.1.0 | 3.7 MB | | 0%  2025-05-07T20:24:50.1090444Z 2025-05-07T20:24:50.1090449Z 2025-05-07T20:24:50.1090461Z 2025-05-07T20:24:50.1090465Z 2025-05-07T20:24:50.1090470Z 2025-05-07T20:24:50.1097188Z libsanitizer-11.4.0 | 3.5 MB | | 0%  2025-05-07T20:24:50.1097674Z 2025-05-07T20:24:50.1097680Z 2025-05-07T20:24:50.1097696Z 2025-05-07T20:24:50.1097702Z 2025-05-07T20:24:50.1097707Z 2025-05-07T20:24:50.1097712Z 2025-05-07T20:24:50.1098991Z libgcc-devel_linux-6 | 2.3 MB | | 0%  2025-05-07T20:24:50.1099468Z 2025-05-07T20:24:50.1099486Z 2025-05-07T20:24:50.1099492Z 2025-05-07T20:24:50.1099497Z 2025-05-07T20:24:50.1099503Z 2025-05-07T20:24:50.1099508Z 2025-05-07T20:24:50.1099514Z 2025-05-07T20:24:50.1101577Z ld_impl_linux-64-2.4 | 691 KB | | 0%  2025-05-07T20:24:50.1102050Z 2025-05-07T20:24:50.1102056Z 2025-05-07T20:24:50.1102061Z 2025-05-07T20:24:50.1102067Z 2025-05-07T20:24:50.1102080Z 2025-05-07T20:24:50.1102085Z 2025-05-07T20:24:50.1102090Z 2025-05-07T20:24:50.1102096Z 2025-05-07T20:24:50.1103367Z libstdcxx-ng-15.1.0 | 34 KB | | 0%  2025-05-07T20:24:50.1103785Z 2025-05-07T20:24:50.1103791Z 2025-05-07T20:24:50.1103796Z 2025-05-07T20:24:50.1103801Z 2025-05-07T20:24:50.1103806Z 2025-05-07T20:24:50.1103811Z 2025-05-07T20:24:50.1103816Z 2025-05-07T20:24:50.1103822Z 2025-05-07T20:24:50.1103827Z 2025-05-07T20:24:50.1120359Z gcc_linux-64-11.4.0 | 31 KB | | 0%  2025-05-07T20:24:50.1120652Z 2025-05-07T20:24:50.1120656Z 2025-05-07T20:24:50.1120660Z 2025-05-07T20:24:50.1120663Z 2025-05-07T20:24:50.1120667Z 2025-05-07T20:24:50.1120670Z 2025-05-07T20:24:50.1120674Z 2025-05-07T20:24:50.1120678Z 2025-05-07T20:24:50.1120681Z 2025-05-07T20:24:50.1121469Z 2025-05-07T20:24:50.1130157Z gxx_linux-64-11.4.0 | 29 KB | | 0%  2025-05-07T20:24:50.1130612Z 2025-05-07T20:24:50.1130618Z 2025-05-07T20:24:50.1130624Z 2025-05-07T20:24:50.1130627Z 2025-05-07T20:24:50.1130631Z 2025-05-07T20:24:50.1130635Z 2025-05-07T20:24:50.1130639Z 2025-05-07T20:24:50.1130643Z 2025-05-07T20:24:50.1130655Z 2025-05-07T20:24:50.1130658Z 2025-05-07T20:24:50.1140563Z 2025-05-07T20:24:50.2332487Z binutils_linux-64-2. | 28 KB | | 0%  2025-05-07T20:24:50.2332840Z 2025-05-07T20:24:50.2332854Z 2025-05-07T20:24:50.2337398Z 2025-05-07T20:24:50.2354053Z binutils_impl_linux- | 6.0 MB | | 0%  2025-05-07T20:24:50.2354325Z 2025-05-07T20:24:50.2354329Z 2025-05-07T20:24:50.2354342Z 2025-05-07T20:24:50.2358370Z 2025-05-07T20:24:50.3405035Z libstdcxx-15.1.0 | 3.7 MB | | 0%  2025-05-07T20:24:50.3405311Z 2025-05-07T20:24:50.3405324Z 2025-05-07T20:24:50.3405328Z 2025-05-07T20:24:50.3560607Z binutils_impl_linux- | 6.0 MB | 8 | 9%  2025-05-07T20:24:50.3560890Z 2025-05-07T20:24:50.3560895Z 2025-05-07T20:24:50.3560899Z 2025-05-07T20:24:50.3593367Z 2025-05-07T20:24:50.3710637Z libstdcxx-15.1.0 | 3.7 MB | #6 | 16%  2025-05-07T20:24:50.3710920Z 2025-05-07T20:24:50.4128904Z gxx_impl_linux-64-11 | 11.2 MB | | 0%  2025-05-07T20:24:50.4129161Z 2025-05-07T20:24:50.4129405Z 2025-05-07T20:24:50.4129410Z 2025-05-07T20:24:50.4129466Z 2025-05-07T20:24:50.4214155Z libstdcxx-15.1.0 | 3.7 MB | ########## | 100%  2025-05-07T20:24:50.4405606Z gcc_impl_linux-64-11 | 53.0 MB | | 0% 2025-05-07T20:24:50.4405860Z 2025-05-07T20:24:50.4405864Z 2025-05-07T20:24:50.4405910Z 2025-05-07T20:24:50.4531283Z binutils_impl_linux- | 6.0 MB | #######2 | 72%  2025-05-07T20:24:50.4531571Z 2025-05-07T20:24:50.4534110Z 2025-05-07T20:24:50.4608715Z libstdcxx-devel_linu | 11.1 MB | | 0%  2025-05-07T20:24:50.4608998Z 2025-05-07T20:24:50.4609002Z 2025-05-07T20:24:50.4609012Z 2025-05-07T20:24:50.4609016Z 2025-05-07T20:24:50.4611172Z 2025-05-07T20:24:50.4712109Z libsanitizer-11.4.0 | 3.5 MB | | 0%  2025-05-07T20:24:50.4712404Z 2025-05-07T20:24:50.5217408Z gxx_impl_linux-64-11 | 11.2 MB | ###3 | 33%  2025-05-07T20:24:50.5532339Z gcc_impl_linux-64-11 | 53.0 MB | 5 | 6% 2025-05-07T20:24:50.5532633Z 2025-05-07T20:24:50.5532637Z 2025-05-07T20:24:50.5713270Z libstdcxx-devel_linu | 11.1 MB | ##5 | 25%  2025-05-07T20:24:50.5714202Z 2025-05-07T20:24:50.6217127Z gxx_impl_linux-64-11 | 11.2 MB | ######2 | 62%  2025-05-07T20:24:50.6405339Z gcc_impl_linux-64-11 | 53.0 MB | #2 | 12% 2025-05-07T20:24:50.6405653Z 2025-05-07T20:24:50.6405657Z 2025-05-07T20:24:50.6405661Z 2025-05-07T20:24:50.6405666Z 2025-05-07T20:24:50.6405671Z 2025-05-07T20:24:50.6406656Z libsanitizer-11.4.0 | 3.5 MB | ########## | 100%  2025-05-07T20:24:50.6406947Z 2025-05-07T20:24:50.6406958Z 2025-05-07T20:24:50.6406971Z 2025-05-07T20:24:50.6406975Z 2025-05-07T20:24:50.6414476Z 2025-05-07T20:24:50.6538646Z libsanitizer-11.4.0 | 3.5 MB | ########## | 100%  2025-05-07T20:24:50.6538944Z 2025-05-07T20:24:50.6539476Z 2025-05-07T20:24:50.6722272Z libstdcxx-devel_linu | 11.1 MB | #####7 | 57%  2025-05-07T20:24:50.6722536Z 2025-05-07T20:24:50.6820754Z gxx_impl_linux-64-11 | 11.2 MB | #########3 | 94%  2025-05-07T20:24:50.6821072Z 2025-05-07T20:24:50.6821078Z 2025-05-07T20:24:50.6827262Z 2025-05-07T20:24:50.7087971Z binutils_impl_linux- | 6.0 MB | ########## | 100%  2025-05-07T20:24:50.7088329Z 2025-05-07T20:24:50.7088333Z 2025-05-07T20:24:50.7088337Z 2025-05-07T20:24:50.7088341Z 2025-05-07T20:24:50.7088346Z 2025-05-07T20:24:50.7090727Z 2025-05-07T20:24:50.7218869Z libgcc-devel_linux-6 | 2.3 MB | | 1%  2025-05-07T20:24:50.7352070Z gcc_impl_linux-64-11 | 53.0 MB | #7 | 18% 2025-05-07T20:24:50.7352473Z 2025-05-07T20:24:50.7352790Z 2025-05-07T20:24:50.7352795Z 2025-05-07T20:24:50.7352800Z 2025-05-07T20:24:50.7352805Z 2025-05-07T20:24:50.7352810Z 2025-05-07T20:24:50.7352815Z 2025-05-07T20:24:50.7544610Z ld_impl_linux-64-2.4 | 691 KB | 2 | 2%  2025-05-07T20:24:50.7545055Z 2025-05-07T20:24:50.7546391Z 2025-05-07T20:24:50.7944276Z libstdcxx-devel_linu | 11.1 MB | ########3 | 83%  2025-05-07T20:24:50.7944557Z 2025-05-07T20:24:50.7944561Z 2025-05-07T20:24:50.7944564Z 2025-05-07T20:24:50.7944568Z 2025-05-07T20:24:50.7944572Z 2025-05-07T20:24:50.7944576Z 2025-05-07T20:24:50.7946029Z 2025-05-07T20:24:50.8222439Z ld_impl_linux-64-2.4 | 691 KB | ########## | 100%  2025-05-07T20:24:50.8603735Z gcc_impl_linux-64-11 | 53.0 MB | ##2 | 23% 2025-05-07T20:24:50.8603974Z 2025-05-07T20:24:50.8604104Z 2025-05-07T20:24:50.8604113Z 2025-05-07T20:24:50.8604134Z 2025-05-07T20:24:50.8604200Z 2025-05-07T20:24:50.8604301Z 2025-05-07T20:24:50.8604327Z 2025-05-07T20:24:50.8607396Z 2025-05-07T20:24:50.8667571Z libstdcxx-ng-15.1.0 | 34 KB | ####7 | 47%  2025-05-07T20:24:50.8667887Z 2025-05-07T20:24:50.8667891Z 2025-05-07T20:24:50.8667895Z 2025-05-07T20:24:50.8667899Z 2025-05-07T20:24:50.8667903Z 2025-05-07T20:24:50.8667906Z 2025-05-07T20:24:50.8667920Z 2025-05-07T20:24:50.8668008Z 2025-05-07T20:24:50.8931963Z libstdcxx-ng-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:24:50.8932299Z 2025-05-07T20:24:50.8932305Z 2025-05-07T20:24:50.8932310Z 2025-05-07T20:24:50.8932315Z 2025-05-07T20:24:50.8932619Z 2025-05-07T20:24:50.8934781Z 2025-05-07T20:24:50.8935319Z libgcc-devel_linux-6 | 2.3 MB | ########## | 100%  2025-05-07T20:24:50.8935636Z 2025-05-07T20:24:50.8935641Z 2025-05-07T20:24:50.8935656Z 2025-05-07T20:24:50.8935679Z 2025-05-07T20:24:50.8935685Z 2025-05-07T20:24:50.8935691Z 2025-05-07T20:24:50.9133479Z libgcc-devel_linux-6 | 2.3 MB | ########## | 100%  2025-05-07T20:24:50.9133823Z 2025-05-07T20:24:50.9133827Z 2025-05-07T20:24:50.9133831Z 2025-05-07T20:24:50.9133835Z 2025-05-07T20:24:50.9133839Z 2025-05-07T20:24:50.9133843Z 2025-05-07T20:24:50.9133846Z 2025-05-07T20:24:50.9133850Z 2025-05-07T20:24:50.9134541Z 2025-05-07T20:24:50.9144936Z gcc_linux-64-11.4.0 | 31 KB | #####2 | 52%  2025-05-07T20:24:50.9145226Z 2025-05-07T20:24:50.9145230Z 2025-05-07T20:24:50.9145234Z 2025-05-07T20:24:50.9147387Z 2025-05-07T20:24:50.9162431Z libstdcxx-15.1.0 | 3.7 MB | ########## | 100%  2025-05-07T20:24:50.9162715Z 2025-05-07T20:24:50.9162719Z 2025-05-07T20:24:50.9162723Z 2025-05-07T20:24:50.9164250Z 2025-05-07T20:24:50.9177089Z libstdcxx-15.1.0 | 3.7 MB | ########## | 100%  2025-05-07T20:24:50.9177388Z 2025-05-07T20:24:50.9177415Z 2025-05-07T20:24:50.9177419Z 2025-05-07T20:24:50.9177423Z 2025-05-07T20:24:50.9177427Z 2025-05-07T20:24:50.9177435Z 2025-05-07T20:24:50.9177439Z 2025-05-07T20:24:50.9177443Z 2025-05-07T20:24:50.9180469Z 2025-05-07T20:24:50.9224852Z gcc_linux-64-11.4.0 | 31 KB | ########## | 100%  2025-05-07T20:24:50.9415658Z gcc_impl_linux-64-11 | 53.0 MB | ##9 | 29% 2025-05-07T20:24:50.9415914Z 2025-05-07T20:24:50.9415918Z 2025-05-07T20:24:50.9415930Z 2025-05-07T20:24:50.9415934Z 2025-05-07T20:24:50.9415937Z 2025-05-07T20:24:50.9415941Z 2025-05-07T20:24:50.9415945Z 2025-05-07T20:24:50.9415950Z 2025-05-07T20:24:50.9415954Z 2025-05-07T20:24:50.9415957Z 2025-05-07T20:24:50.9452703Z gxx_linux-64-11.4.0 | 29 KB | #####5 | 55%  2025-05-07T20:24:50.9453149Z 2025-05-07T20:24:50.9453156Z 2025-05-07T20:24:50.9453161Z 2025-05-07T20:24:50.9453167Z 2025-05-07T20:24:50.9453172Z 2025-05-07T20:24:50.9453178Z 2025-05-07T20:24:50.9453470Z 2025-05-07T20:24:50.9453476Z 2025-05-07T20:24:50.9453479Z 2025-05-07T20:24:50.9453483Z 2025-05-07T20:24:50.9732228Z gxx_linux-64-11.4.0 | 29 KB | ########## | 100%  2025-05-07T20:24:50.9732546Z 2025-05-07T20:24:50.9732551Z 2025-05-07T20:24:50.9732555Z 2025-05-07T20:24:50.9732558Z 2025-05-07T20:24:50.9732562Z 2025-05-07T20:24:50.9732566Z 2025-05-07T20:24:50.9732570Z 2025-05-07T20:24:50.9732574Z 2025-05-07T20:24:50.9732578Z 2025-05-07T20:24:50.9732581Z 2025-05-07T20:24:50.9732585Z 2025-05-07T20:24:50.9773470Z binutils_linux-64-2. | 28 KB | #####6 | 56%  2025-05-07T20:24:50.9773787Z 2025-05-07T20:24:50.9773791Z 2025-05-07T20:24:50.9773794Z 2025-05-07T20:24:50.9773798Z 2025-05-07T20:24:50.9773810Z 2025-05-07T20:24:50.9773814Z 2025-05-07T20:24:50.9773818Z 2025-05-07T20:24:50.9773821Z 2025-05-07T20:24:50.9773827Z 2025-05-07T20:24:50.9773831Z 2025-05-07T20:24:50.9776921Z 2025-05-07T20:24:51.0227913Z binutils_linux-64-2. | 28 KB | ########## | 100%  2025-05-07T20:24:51.0433242Z gcc_impl_linux-64-11 | 53.0 MB | ###6 | 36% 2025-05-07T20:24:51.0433593Z 2025-05-07T20:24:51.0433597Z 2025-05-07T20:24:51.0433601Z 2025-05-07T20:24:51.0433621Z 2025-05-07T20:24:51.0433624Z 2025-05-07T20:24:51.0433628Z 2025-05-07T20:24:51.0437253Z 2025-05-07T20:24:51.0453548Z ld_impl_linux-64-2.4 | 691 KB | ########## | 100%  2025-05-07T20:24:51.0453930Z 2025-05-07T20:24:51.0453934Z 2025-05-07T20:24:51.0453938Z 2025-05-07T20:24:51.0453942Z 2025-05-07T20:24:51.0453945Z 2025-05-07T20:24:51.0453949Z 2025-05-07T20:24:51.0453953Z 2025-05-07T20:24:51.0742919Z ld_impl_linux-64-2.4 | 691 KB | ########## | 100%  2025-05-07T20:24:51.0743231Z 2025-05-07T20:24:51.1073777Z gxx_impl_linux-64-11 | 11.2 MB | ########## | 100%  2025-05-07T20:24:51.1074086Z 2025-05-07T20:24:51.1077083Z 2025-05-07T20:24:51.1229256Z libstdcxx-devel_linu | 11.1 MB | ########## | 100%  2025-05-07T20:24:51.1436311Z gcc_impl_linux-64-11 | 53.0 MB | ####3 | 44% 2025-05-07T20:24:51.1436623Z 2025-05-07T20:24:51.1436629Z 2025-05-07T20:24:51.1436634Z 2025-05-07T20:24:51.1436639Z 2025-05-07T20:24:51.1436644Z 2025-05-07T20:24:51.1436649Z 2025-05-07T20:24:51.1436655Z 2025-05-07T20:24:51.1436660Z 2025-05-07T20:24:51.1441385Z libstdcxx-ng-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:24:51.1441914Z 2025-05-07T20:24:51.1441918Z 2025-05-07T20:24:51.1441921Z 2025-05-07T20:24:51.1441925Z 2025-05-07T20:24:51.1441929Z 2025-05-07T20:24:51.1441932Z 2025-05-07T20:24:51.1441936Z 2025-05-07T20:24:51.1442290Z 2025-05-07T20:24:51.1920189Z libstdcxx-ng-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:24:51.1920515Z 2025-05-07T20:24:51.1920520Z 2025-05-07T20:24:51.1920524Z 2025-05-07T20:24:51.1920528Z 2025-05-07T20:24:51.1920531Z 2025-05-07T20:24:51.2234861Z libsanitizer-11.4.0 | 3.5 MB | ########## | 100%  2025-05-07T20:24:51.2600220Z gcc_impl_linux-64-11 | 53.0 MB | #####3 | 54% 2025-05-07T20:24:51.2600630Z 2025-05-07T20:24:51.2600636Z 2025-05-07T20:24:51.2600642Z 2025-05-07T20:24:51.2600647Z 2025-05-07T20:24:51.2600652Z 2025-05-07T20:24:51.2600657Z 2025-05-07T20:24:51.2600663Z 2025-05-07T20:24:51.2600668Z 2025-05-07T20:24:51.2600674Z 2025-05-07T20:24:51.2607523Z gcc_linux-64-11.4.0 | 31 KB | ########## | 100%  2025-05-07T20:24:51.2607909Z 2025-05-07T20:24:51.2607913Z 2025-05-07T20:24:51.2607917Z 2025-05-07T20:24:51.2607921Z 2025-05-07T20:24:51.2607925Z 2025-05-07T20:24:51.2607929Z 2025-05-07T20:24:51.2607932Z 2025-05-07T20:24:51.2607936Z 2025-05-07T20:24:51.2607940Z 2025-05-07T20:24:51.3197998Z gcc_linux-64-11.4.0 | 31 KB | ########## | 100%  2025-05-07T20:24:51.3198427Z 2025-05-07T20:24:51.3198431Z 2025-05-07T20:24:51.3198435Z 2025-05-07T20:24:51.3198439Z 2025-05-07T20:24:51.3198726Z 2025-05-07T20:24:51.3198731Z 2025-05-07T20:24:51.3198735Z 2025-05-07T20:24:51.3198738Z 2025-05-07T20:24:51.3198742Z 2025-05-07T20:24:51.3198891Z 2025-05-07T20:24:51.3201345Z gxx_linux-64-11.4.0 | 29 KB | ########## | 100%  2025-05-07T20:24:51.3201731Z 2025-05-07T20:24:51.3201737Z 2025-05-07T20:24:51.3201743Z 2025-05-07T20:24:51.3201748Z 2025-05-07T20:24:51.3201754Z 2025-05-07T20:24:51.3201760Z 2025-05-07T20:24:51.3201765Z 2025-05-07T20:24:51.3201770Z 2025-05-07T20:24:51.3201775Z 2025-05-07T20:24:51.3203718Z 2025-05-07T20:24:51.3235881Z gxx_linux-64-11.4.0 | 29 KB | ########## | 100%  2025-05-07T20:24:51.3680537Z gcc_impl_linux-64-11 | 53.0 MB | ######3 | 64% 2025-05-07T20:24:51.3680903Z 2025-05-07T20:24:51.3680908Z 2025-05-07T20:24:51.3680915Z 2025-05-07T20:24:51.3680920Z 2025-05-07T20:24:51.3680925Z 2025-05-07T20:24:51.3680929Z 2025-05-07T20:24:51.3833561Z libgcc-devel_linux-6 | 2.3 MB | ########## | 100%  2025-05-07T20:24:51.3833948Z 2025-05-07T20:24:51.3833952Z 2025-05-07T20:24:51.3833956Z 2025-05-07T20:24:51.3833969Z 2025-05-07T20:24:51.3833973Z 2025-05-07T20:24:51.3833978Z 2025-05-07T20:24:51.3833982Z 2025-05-07T20:24:51.3833985Z 2025-05-07T20:24:51.3833998Z 2025-05-07T20:24:51.3834001Z 2025-05-07T20:24:51.3835599Z 2025-05-07T20:24:51.3840836Z binutils_linux-64-2. | 28 KB | ########## | 100%  2025-05-07T20:24:51.3841134Z 2025-05-07T20:24:51.3841317Z 2025-05-07T20:24:51.3841345Z 2025-05-07T20:24:51.3841352Z 2025-05-07T20:24:51.3841357Z 2025-05-07T20:24:51.3841363Z 2025-05-07T20:24:51.3841369Z 2025-05-07T20:24:51.3841481Z 2025-05-07T20:24:51.3841487Z 2025-05-07T20:24:51.3841499Z 2025-05-07T20:24:51.3841569Z 2025-05-07T20:24:51.4236775Z binutils_linux-64-2. | 28 KB | ########## | 100%  2025-05-07T20:24:51.5236973Z gcc_impl_linux-64-11 | 53.0 MB | #######2 | 73% 2025-05-07T20:24:51.6238055Z gcc_impl_linux-64-11 | 53.0 MB | ########1 | 82% 2025-05-07T20:24:51.6969650Z gcc_impl_linux-64-11 | 53.0 MB | #########4 | 94% 2025-05-07T20:24:51.6970053Z 2025-05-07T20:24:51.6970059Z 2025-05-07T20:24:51.6970267Z 2025-05-07T20:24:51.8160054Z binutils_impl_linux- | 6.0 MB | ########## | 100%  2025-05-07T20:24:51.8160348Z 2025-05-07T20:24:51.8313406Z gxx_impl_linux-64-11 | 11.2 MB | ########## | 100%  2025-05-07T20:24:52.1461042Z gcc_impl_linux-64-11 | 53.0 MB | ########## | 100% 2025-05-07T20:24:52.1461355Z 2025-05-07T20:24:52.1461360Z 2025-05-07T20:24:52.5737456Z libstdcxx-devel_linu | 11.1 MB | ########## | 100%  2025-05-07T20:24:52.5743905Z gcc_impl_linux-64-11 | 53.0 MB | ########## | 100% 2025-05-07T20:24:52.5744330Z 2025-05-07T20:24:52.5744602Z 2025-05-07T20:24:52.5744893Z  2025-05-07T20:24:52.5745174Z 2025-05-07T20:24:52.5745201Z 2025-05-07T20:24:52.5745464Z  2025-05-07T20:24:52.5745749Z 2025-05-07T20:24:52.5745768Z 2025-05-07T20:24:52.5745773Z 2025-05-07T20:24:52.5745995Z  2025-05-07T20:24:52.5746293Z 2025-05-07T20:24:52.5746298Z 2025-05-07T20:24:52.5746303Z 2025-05-07T20:24:52.5746308Z 2025-05-07T20:24:52.5746555Z  2025-05-07T20:24:52.5746859Z 2025-05-07T20:24:52.5746864Z 2025-05-07T20:24:52.5746869Z 2025-05-07T20:24:52.5746874Z 2025-05-07T20:24:52.5746879Z 2025-05-07T20:24:52.5747117Z  2025-05-07T20:24:52.5747417Z 2025-05-07T20:24:52.5747422Z 2025-05-07T20:24:52.5747427Z 2025-05-07T20:24:52.5747432Z 2025-05-07T20:24:52.5747437Z 2025-05-07T20:24:52.5747442Z 2025-05-07T20:24:52.5747705Z  2025-05-07T20:24:52.5748174Z 2025-05-07T20:24:52.5748178Z 2025-05-07T20:24:52.5748182Z 2025-05-07T20:24:52.5748186Z 2025-05-07T20:24:52.5748321Z 2025-05-07T20:24:52.5748326Z 2025-05-07T20:24:52.5748330Z 2025-05-07T20:24:52.5748546Z  2025-05-07T20:24:52.5748863Z 2025-05-07T20:24:52.5748869Z 2025-05-07T20:24:52.5748874Z 2025-05-07T20:24:52.5748879Z 2025-05-07T20:24:52.5748884Z 2025-05-07T20:24:52.5748889Z 2025-05-07T20:24:52.5748895Z 2025-05-07T20:24:52.5748900Z 2025-05-07T20:24:52.5749094Z  2025-05-07T20:24:52.5749317Z 2025-05-07T20:24:52.5749320Z 2025-05-07T20:24:52.5749324Z 2025-05-07T20:24:52.5749327Z 2025-05-07T20:24:52.5749331Z 2025-05-07T20:24:52.5749335Z 2025-05-07T20:24:52.5749339Z 2025-05-07T20:24:52.5749342Z 2025-05-07T20:24:52.5749346Z 2025-05-07T20:24:52.5749539Z  2025-05-07T20:24:52.5749760Z 2025-05-07T20:24:52.5749764Z 2025-05-07T20:24:52.5749767Z 2025-05-07T20:24:52.5749771Z 2025-05-07T20:24:52.5749780Z 2025-05-07T20:24:52.5749784Z 2025-05-07T20:24:52.5749788Z 2025-05-07T20:24:52.5749791Z 2025-05-07T20:24:52.5749795Z 2025-05-07T20:24:52.5749798Z 2025-05-07T20:24:52.5749992Z  2025-05-07T20:24:52.5750211Z 2025-05-07T20:24:52.5750215Z 2025-05-07T20:24:52.5750230Z 2025-05-07T20:24:52.5750234Z 2025-05-07T20:24:52.5750238Z 2025-05-07T20:24:52.5750242Z 2025-05-07T20:24:52.5750246Z 2025-05-07T20:24:52.5750250Z 2025-05-07T20:24:52.5750253Z 2025-05-07T20:24:52.5750257Z 2025-05-07T20:24:52.5750261Z 2025-05-07T20:24:52.5750466Z  done 2025-05-07T20:24:52.6749611Z Preparing transaction: \ done 2025-05-07T20:24:52.9756091Z Verifying transaction: / - \ done 2025-05-07T20:24:53.0766073Z Executing transaction: / done 2025-05-07T20:24:53.2424641Z [INSTALL] Setting the C/C++ compiler symlinks ... 2025-05-07T20:24:57.1506935Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-cc /home/ec2-user/miniconda/envs/build_binary/bin/cc 2025-05-07T20:24:57.1507482Z 2025-05-07T20:24:57.1518449Z 2025-05-07T20:24:57.1537639Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-cc /home/ec2-user/miniconda/envs/build_binary/bin/gcc 2025-05-07T20:24:57.1538164Z 2025-05-07T20:24:57.1550708Z 2025-05-07T20:24:57.1567804Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-c++ /home/ec2-user/miniconda/envs/build_binary/bin/c++ 2025-05-07T20:24:57.1568319Z 2025-05-07T20:24:57.1579718Z 2025-05-07T20:24:57.1597868Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-c++ /home/ec2-user/miniconda/envs/build_binary/bin/g++ 2025-05-07T20:24:57.1598396Z 2025-05-07T20:24:57.1611117Z 2025-05-07T20:24:59.0412759Z /home/ec2-user/miniconda/envs/build_binary/bin/cc 2025-05-07T20:24:59.0413041Z 2025-05-07T20:24:59.1049737Z [CHECK] Binary cc found in PATH 2025-05-07T20:25:00.9854065Z /home/ec2-user/miniconda/envs/build_binary/bin/gcc 2025-05-07T20:25:00.9854351Z 2025-05-07T20:25:01.0485249Z [CHECK] Binary gcc found in PATH 2025-05-07T20:25:02.9433445Z /home/ec2-user/miniconda/envs/build_binary/bin/c++ 2025-05-07T20:25:02.9433841Z 2025-05-07T20:25:03.0070235Z [CHECK] Binary c++ found in PATH 2025-05-07T20:25:04.8870958Z /home/ec2-user/miniconda/envs/build_binary/bin/g++ 2025-05-07T20:25:04.8871374Z 2025-05-07T20:25:04.9516424Z [CHECK] Binary g++ found in PATH 2025-05-07T20:25:04.9521071Z [INFO] Printing out all preprocessor defines in the C compiler ... 2025-05-07T20:25:04.9522207Z + conda run -n build_binary cc -dM -E - 2025-05-07T20:25:04.9522780Z 2025-05-07T20:25:06.8398053Z #define __DBL_MIN_EXP__ (-1021) 2025-05-07T20:25:06.8398477Z #define __UINT_LEAST16_MAX__ 0xffff 2025-05-07T20:25:06.8399137Z #define __ATOMIC_ACQUIRE 2 2025-05-07T20:25:06.8399411Z #define __FLT128_MAX_10_EXP__ 4932 2025-05-07T20:25:06.8399750Z #define __FLT_MIN__ 1.17549435082228750796873653722224568e-38F 2025-05-07T20:25:06.8400258Z #define __GCC_IEC_559_COMPLEX 2 2025-05-07T20:25:06.8400555Z #define __UINT_LEAST8_TYPE__ unsigned char 2025-05-07T20:25:06.8400861Z #define __SIZEOF_FLOAT80__ 16 2025-05-07T20:25:06.8401121Z #define __INTMAX_C(c) c ## L 2025-05-07T20:25:06.8401373Z #define __CHAR_BIT__ 8 2025-05-07T20:25:06.8401612Z #define __UINT8_MAX__ 0xff 2025-05-07T20:25:06.8401856Z #define __SCHAR_WIDTH__ 8 2025-05-07T20:25:06.8402113Z #define __WINT_MAX__ 0xffffffffU 2025-05-07T20:25:06.8402389Z #define __FLT32_MIN_EXP__ (-125) 2025-05-07T20:25:06.8402655Z #define __ORDER_LITTLE_ENDIAN__ 1234 2025-05-07T20:25:06.8402952Z #define __SIZE_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:06.8403256Z #define __WCHAR_MAX__ 0x7fffffff 2025-05-07T20:25:06.8403546Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_1 1 2025-05-07T20:25:06.8403875Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_2 1 2025-05-07T20:25:06.8404203Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4 1 2025-05-07T20:25:06.8404621Z #define __DBL_DENORM_MIN__ ((double)4.94065645841246544176568792868221372e-324L) 2025-05-07T20:25:06.8405023Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_8 1 2025-05-07T20:25:06.8405341Z #define __GCC_ATOMIC_CHAR_LOCK_FREE 2 2025-05-07T20:25:06.8405622Z #define __GCC_IEC_559 2 2025-05-07T20:25:06.8405862Z #define __FLT32X_DECIMAL_DIG__ 17 2025-05-07T20:25:06.8406135Z #define __FLT_EVAL_METHOD__ 0 2025-05-07T20:25:06.8406402Z #define __FLT64_DECIMAL_DIG__ 17 2025-05-07T20:25:06.8406675Z #define __GCC_ATOMIC_CHAR32_T_LOCK_FREE 2 2025-05-07T20:25:06.8407009Z #define __UINT_FAST64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:06.8407335Z #define __SIG_ATOMIC_TYPE__ int 2025-05-07T20:25:06.8407611Z #define __DBL_MIN_10_EXP__ (-307) 2025-05-07T20:25:06.8407880Z #define __FINITE_MATH_ONLY__ 0 2025-05-07T20:25:06.8408149Z #define __FLT32X_MAX_EXP__ 1024 2025-05-07T20:25:06.8408419Z #define __FLT32_HAS_DENORM__ 1 2025-05-07T20:25:06.8408675Z #define __UINT_FAST8_MAX__ 0xff 2025-05-07T20:25:06.8408941Z #define __FLT32_MAX_10_EXP__ 38 2025-05-07T20:25:06.8409209Z #define __DEC64_MAX_EXP__ 385 2025-05-07T20:25:06.8409457Z #define __INT8_C(c) c 2025-05-07T20:25:06.8409698Z #define __INT_LEAST8_WIDTH__ 8 2025-05-07T20:25:06.8409995Z #define __UINT_LEAST64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:06.8410302Z #define __SHRT_MAX__ 0x7fff 2025-05-07T20:25:06.8410615Z #define __LDBL_MAX__ 1.18973149535723176502126385303097021e+4932L 2025-05-07T20:25:06.8410964Z #define __FLT64X_MAX_10_EXP__ 4932 2025-05-07T20:25:06.8411232Z #define __LDBL_IS_IEC_60559__ 2 2025-05-07T20:25:06.8411500Z #define __FLT64X_HAS_QUIET_NAN__ 1 2025-05-07T20:25:06.8411783Z #define __UINT_LEAST8_MAX__ 0xff 2025-05-07T20:25:06.8412065Z #define __GCC_ATOMIC_BOOL_LOCK_FREE 2 2025-05-07T20:25:06.8412442Z #define __FLT128_DENORM_MIN__ 6.47517511943802511092443895822764655e-4966F128 2025-05-07T20:25:06.8412863Z #define __UINTMAX_TYPE__ long unsigned int 2025-05-07T20:25:06.8413150Z #define __linux 1 2025-05-07T20:25:06.8413372Z #define __DEC32_EPSILON__ 1E-6DF 2025-05-07T20:25:06.8413657Z #define __FLT_EVAL_METHOD_TS_18661_3__ 0 2025-05-07T20:25:06.8413936Z #define __unix 1 2025-05-07T20:25:06.8414160Z #define __UINT32_MAX__ 0xffffffffU 2025-05-07T20:25:06.8414436Z #define __FLT128_MIN_EXP__ (-16381) 2025-05-07T20:25:06.8414712Z #define __WINT_MIN__ 0U 2025-05-07T20:25:06.8414950Z #define __FLT128_MIN_10_EXP__ (-4931) 2025-05-07T20:25:06.8415230Z #define __FLT32X_IS_IEC_60559__ 2 2025-05-07T20:25:06.8415503Z #define __INT_LEAST16_WIDTH__ 16 2025-05-07T20:25:06.8415764Z #define __SCHAR_MAX__ 0x7f 2025-05-07T20:25:06.8416017Z #define __FLT128_MANT_DIG__ 113 2025-05-07T20:25:06.8416299Z #define __WCHAR_MIN__ (-__WCHAR_MAX__ - 1) 2025-05-07T20:25:06.8416592Z #define __INT64_C(c) c ## L 2025-05-07T20:25:06.8416853Z #define __GCC_ATOMIC_POINTER_LOCK_FREE 2 2025-05-07T20:25:06.8417147Z #define __FLT32X_MANT_DIG__ 53 2025-05-07T20:25:06.8417501Z #define __USER_LABEL_PREFIX__ 2025-05-07T20:25:06.8417848Z #define __FLT64X_EPSILON__ 1.08420217248550443400745280086994171e-19F64x 2025-05-07T20:25:06.8418296Z #define __STDC_HOSTED__ 1 2025-05-07T20:25:06.8418548Z #define __DEC64_MIN_EXP__ (-382) 2025-05-07T20:25:06.8418801Z #define __DBL_DIG__ 15 2025-05-07T20:25:06.8419033Z #define __FLT32_DIG__ 6 2025-05-07T20:25:06.8419335Z #define __FLT_EPSILON__ 1.19209289550781250000000000000000000e-7F 2025-05-07T20:25:06.8419677Z #define __SHRT_WIDTH__ 16 2025-05-07T20:25:06.8420009Z #define __FLT32_IS_IEC_60559__ 2 2025-05-07T20:25:06.8420335Z #define __LDBL_MIN__ 3.36210314311209350626267781732175260e-4932L 2025-05-07T20:25:06.8420675Z #define __STDC_UTF_16__ 1 2025-05-07T20:25:06.8420918Z #define __DBL_IS_IEC_60559__ 2 2025-05-07T20:25:06.8421179Z #define __DEC32_MAX__ 9.999999E96DF 2025-05-07T20:25:06.8421553Z #define __FLT64X_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951F64x 2025-05-07T20:25:06.8421939Z #define __FLT32X_HAS_INFINITY__ 1 2025-05-07T20:25:06.8422216Z #define __INT32_MAX__ 0x7fffffff 2025-05-07T20:25:06.8422471Z #define __unix__ 1 2025-05-07T20:25:06.8422685Z #define __INT_WIDTH__ 32 2025-05-07T20:25:06.8422937Z #define __SIZEOF_LONG__ 8 2025-05-07T20:25:06.8423181Z #define __STDC_IEC_559__ 1 2025-05-07T20:25:06.8423425Z #define __STDC_ISO_10646__ 201103L 2025-05-07T20:25:06.8423688Z #define __UINT16_C(c) c 2025-05-07T20:25:06.8423930Z #define __DECIMAL_DIG__ 21 2025-05-07T20:25:06.8424175Z #define __STDC_IEC_559_COMPLEX__ 1 2025-05-07T20:25:06.8424527Z #define __FLT64_EPSILON__ 2.22044604925031308084726333618164062e-16F64 2025-05-07T20:25:06.8424885Z #define __gnu_linux__ 1 2025-05-07T20:25:06.8425125Z #define __FLT128_IS_IEC_60559__ 2 2025-05-07T20:25:06.8425390Z #define __FLT64X_MIN_10_EXP__ (-4931) 2025-05-07T20:25:06.8425681Z #define __LDBL_HAS_QUIET_NAN__ 1 2025-05-07T20:25:06.8425945Z #define __FLT64_MANT_DIG__ 53 2025-05-07T20:25:06.8426198Z #define __FLT64X_MANT_DIG__ 64 2025-05-07T20:25:06.8426447Z #define __GNUC__ 11 2025-05-07T20:25:06.8426667Z #define __pie__ 2 2025-05-07T20:25:06.8426884Z #define __MMX__ 1 2025-05-07T20:25:06.8427110Z #define __FLT_HAS_DENORM__ 1 2025-05-07T20:25:06.8427377Z #define __SIZEOF_LONG_DOUBLE__ 16 2025-05-07T20:25:06.8427645Z #define __BIGGEST_ALIGNMENT__ 16 2025-05-07T20:25:06.8427915Z #define __FLT64_MAX_10_EXP__ 308 2025-05-07T20:25:06.8428258Z #define __DBL_MAX__ ((double)1.79769313486231570814527423731704357e+308L) 2025-05-07T20:25:06.8428644Z #define __INT_FAST32_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:06.8428980Z #define __DBL_HAS_INFINITY__ 1 2025-05-07T20:25:06.8429234Z #define __SIZEOF_FLOAT__ 4 2025-05-07T20:25:06.8429499Z #define __HAVE_SPECULATION_SAFE_VALUE 1 2025-05-07T20:25:06.8429794Z #define __DEC32_MIN_EXP__ (-94) 2025-05-07T20:25:06.8430062Z #define __INTPTR_WIDTH__ 64 2025-05-07T20:25:06.8430317Z #define __FLT64X_HAS_INFINITY__ 1 2025-05-07T20:25:06.8430600Z #define __UINT_LEAST32_MAX__ 0xffffffffU 2025-05-07T20:25:06.8440159Z #define __FLT32X_HAS_DENORM__ 1 2025-05-07T20:25:06.8440474Z #define __INT_FAST16_TYPE__ long int 2025-05-07T20:25:06.8440762Z #define __MMX_WITH_SSE__ 1 2025-05-07T20:25:06.8441028Z #define __LDBL_HAS_DENORM__ 1 2025-05-07T20:25:06.8441312Z #define __FLT128_HAS_INFINITY__ 1 2025-05-07T20:25:06.8441584Z #define __DEC32_MIN__ 1E-95DF 2025-05-07T20:25:06.8441852Z #define __DBL_MAX_EXP__ 1024 2025-05-07T20:25:06.8442103Z #define __WCHAR_WIDTH__ 32 2025-05-07T20:25:06.8442413Z #define __FLT32_MAX__ 3.40282346638528859811704183484516925e+38F32 2025-05-07T20:25:06.8442778Z #define __DEC128_EPSILON__ 1E-33DL 2025-05-07T20:25:06.8443053Z #define __SSE2_MATH__ 1 2025-05-07T20:25:06.8443307Z #define __ATOMIC_HLE_RELEASE 131072 2025-05-07T20:25:06.8443604Z #define __PTRDIFF_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:06.8443903Z #define __amd64 1 2025-05-07T20:25:06.8444137Z #define __STDC_NO_THREADS__ 1 2025-05-07T20:25:06.8444406Z #define __ATOMIC_HLE_ACQUIRE 65536 2025-05-07T20:25:06.8444702Z #define __LONG_LONG_MAX__ 0x7fffffffffffffffLL 2025-05-07T20:25:06.8445123Z #define __SIZEOF_SIZE_T__ 8 2025-05-07T20:25:06.8445372Z #define __FLT64X_MIN_EXP__ (-16381) 2025-05-07T20:25:06.8445652Z #define __SIZEOF_WINT_T__ 4 2025-05-07T20:25:06.8446014Z #define __LONG_LONG_WIDTH__ 64 2025-05-07T20:25:06.8446272Z #define __FLT32_MAX_EXP__ 128 2025-05-07T20:25:06.8446534Z #define __GXX_ABI_VERSION 1016 2025-05-07T20:25:06.8446792Z #define __FLT_MIN_EXP__ (-125) 2025-05-07T20:25:06.8447045Z #define __GCC_HAVE_DWARF2_CFI_ASM 1 2025-05-07T20:25:06.8447319Z #define __INT16_MAX__ 0x7fff 2025-05-07T20:25:06.8447567Z #define __x86_64 1 2025-05-07T20:25:06.8447797Z #define __INT_FAST64_TYPE__ long int 2025-05-07T20:25:06.8448156Z #define __FLT64_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F64 2025-05-07T20:25:06.8448617Z #define __DBL_MIN__ ((double)2.22507385850720138309023271733240406e-308L) 2025-05-07T20:25:06.8449068Z #define __FLT128_EPSILON__ 1.92592994438723585305597794258492732e-34F128 2025-05-07T20:25:06.8449580Z #define __FLT64X_NORM_MAX__ 1.18973149535723176502126385303097021e+4932F64x 2025-05-07T20:25:06.8449961Z #define __SIZEOF_POINTER__ 8 2025-05-07T20:25:06.8450206Z #define __LP64__ 1 2025-05-07T20:25:06.8450446Z #define __DBL_HAS_QUIET_NAN__ 1 2025-05-07T20:25:06.8450793Z #define __FLT32X_EPSILON__ 2.22044604925031308084726333618164062e-16F32x 2025-05-07T20:25:06.8451161Z #define __DECIMAL_BID_FORMAT__ 1 2025-05-07T20:25:06.8451436Z #define __FLT64_MIN_EXP__ (-1021) 2025-05-07T20:25:06.8451712Z #define __FLT64_MIN_10_EXP__ (-307) 2025-05-07T20:25:06.8451987Z #define __FLT64X_DECIMAL_DIG__ 21 2025-05-07T20:25:06.8452264Z #define __DEC128_MIN__ 1E-6143DL 2025-05-07T20:25:06.8452533Z #define __REGISTER_PREFIX__ 2025-05-07T20:25:06.8452791Z #define __UINT16_MAX__ 0xffff 2025-05-07T20:25:06.8453046Z #define __DBL_HAS_DENORM__ 1 2025-05-07T20:25:06.8453307Z #define __LDBL_HAS_INFINITY__ 1 2025-05-07T20:25:06.8453636Z #define __FLT32_MIN__ 1.17549435082228750796873653722224568e-38F32 2025-05-07T20:25:06.8453982Z #define __UINT8_TYPE__ unsigned char 2025-05-07T20:25:06.8454265Z #define __FLT_DIG__ 6 2025-05-07T20:25:06.8454500Z #define __NO_INLINE__ 1 2025-05-07T20:25:06.8454737Z #define __DEC_EVAL_METHOD__ 2 2025-05-07T20:25:06.8455069Z #define __DEC128_MAX__ 9.999999999999999999999999999999999E6144DL 2025-05-07T20:25:06.8455421Z #define __FLT_MANT_DIG__ 24 2025-05-07T20:25:06.8455672Z #define __LDBL_DECIMAL_DIG__ 21 2025-05-07T20:25:06.8455934Z #define __VERSION__ "11.4.0" 2025-05-07T20:25:06.8456193Z #define __UINT64_C(c) c ## UL 2025-05-07T20:25:06.8456444Z #define _STDC_PREDEF_H 1 2025-05-07T20:25:06.8456705Z #define __INT_LEAST32_MAX__ 0x7fffffff 2025-05-07T20:25:06.8456997Z #define __GCC_ATOMIC_INT_LOCK_FREE 2 2025-05-07T20:25:06.8457285Z #define __FLT128_MAX_EXP__ 16384 2025-05-07T20:25:06.8457550Z #define __FLT32_MANT_DIG__ 24 2025-05-07T20:25:06.8457857Z #define __FLOAT_WORD_ORDER__ __ORDER_LITTLE_ENDIAN__ 2025-05-07T20:25:06.8458183Z #define __FLT128_HAS_DENORM__ 1 2025-05-07T20:25:06.8458443Z #define __FLT32_DECIMAL_DIG__ 9 2025-05-07T20:25:06.8458708Z #define __FLT128_DIG__ 33 2025-05-07T20:25:06.8458949Z #define __INT32_C(c) c 2025-05-07T20:25:06.8459184Z #define __DEC64_EPSILON__ 1E-15DD 2025-05-07T20:25:06.8459475Z #define __ORDER_PDP_ENDIAN__ 3412 2025-05-07T20:25:06.8459757Z #define __DEC128_MIN_EXP__ (-6142) 2025-05-07T20:25:06.8460124Z #define __INT_FAST32_TYPE__ long int 2025-05-07T20:25:06.8460438Z #define __UINT_LEAST16_TYPE__ short unsigned int 2025-05-07T20:25:06.8460740Z #define unix 1 2025-05-07T20:25:06.8460965Z #define __SIZE_TYPE__ long unsigned int 2025-05-07T20:25:06.8461274Z #define __UINT64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:06.8461575Z #define __FLT_IS_IEC_60559__ 2 2025-05-07T20:25:06.8461886Z #define __GNUC_WIDE_EXECUTION_CHARSET_NAME "UTF-32LE" 2025-05-07T20:25:06.8462206Z #define __FLT64X_DIG__ 18 2025-05-07T20:25:06.8462460Z #define __INT8_TYPE__ signed char 2025-05-07T20:25:06.8462719Z #define __ELF__ 1 2025-05-07T20:25:06.8462944Z #define __GCC_ASM_FLAG_OUTPUTS__ 1 2025-05-07T20:25:06.8463329Z #define __UINT32_TYPE__ unsigned int 2025-05-07T20:25:06.8463609Z #define __FLT_RADIX__ 2 2025-05-07T20:25:06.8463851Z #define __INT_LEAST16_TYPE__ short int 2025-05-07T20:25:06.8464286Z #define __LDBL_EPSILON__ 1.08420217248550443400745280086994171e-19L 2025-05-07T20:25:06.8464648Z #define __UINTMAX_C(c) c ## UL 2025-05-07T20:25:06.8464896Z #define __SSE_MATH__ 1 2025-05-07T20:25:06.8465117Z #define __k8 1 2025-05-07T20:25:06.8465407Z #define __FLT32X_MIN__ 2.22507385850720138309023271733240406e-308F32x 2025-05-07T20:25:06.8465771Z #define __SIG_ATOMIC_MAX__ 0x7fffffff 2025-05-07T20:25:06.8466065Z #define __GCC_ATOMIC_WCHAR_T_LOCK_FREE 2 2025-05-07T20:25:06.8466359Z #define __SIZEOF_PTRDIFF_T__ 8 2025-05-07T20:25:06.8466613Z #define __LDBL_DIG__ 18 2025-05-07T20:25:06.8466848Z #define __FLT64_IS_IEC_60559__ 2 2025-05-07T20:25:06.8467104Z #define __x86_64__ 1 2025-05-07T20:25:06.8467346Z #define __FLT32X_MIN_EXP__ (-1021) 2025-05-07T20:25:06.8467639Z #define __DEC32_SUBNORMAL_MIN__ 0.000001E-95DF 2025-05-07T20:25:06.8467978Z #define __INT_FAST16_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:06.8468287Z #define __FLT64_DIG__ 15 2025-05-07T20:25:06.8468563Z #define __UINT_FAST32_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:06.8468920Z #define __UINT_LEAST64_TYPE__ long unsigned int 2025-05-07T20:25:06.8469242Z #define __FLT_HAS_QUIET_NAN__ 1 2025-05-07T20:25:06.8469499Z #define __FLT_MAX_10_EXP__ 38 2025-05-07T20:25:06.8469776Z #define __LONG_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:06.8470070Z #define __FLT64X_HAS_DENORM__ 1 2025-05-07T20:25:06.8470437Z #define __DEC128_SUBNORMAL_MIN__ 0.000000000000000000000000000000001E-6143DL 2025-05-07T20:25:06.8470826Z #define __FLT_HAS_INFINITY__ 1 2025-05-07T20:25:06.8471120Z #define __GNUC_EXECUTION_CHARSET_NAME "UTF-8" 2025-05-07T20:25:06.8471453Z #define __UINT_FAST16_TYPE__ long unsigned int 2025-05-07T20:25:06.8471772Z #define __DEC64_MAX__ 9.999999999999999E384DD 2025-05-07T20:25:06.8472072Z #define __INT_FAST32_WIDTH__ 64 2025-05-07T20:25:06.8472351Z #define __CHAR16_TYPE__ short unsigned int 2025-05-07T20:25:06.8472654Z #define __PRAGMA_REDEFINE_EXTNAME 1 2025-05-07T20:25:06.8472934Z #define __SIZE_WIDTH__ 64 2025-05-07T20:25:06.8473178Z #define __SEG_FS 1 2025-05-07T20:25:06.8473412Z #define __INT_LEAST16_MAX__ 0x7fff 2025-05-07T20:25:06.8473691Z #define __DEC64_MANT_DIG__ 16 2025-05-07T20:25:06.8473970Z #define __INT64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:06.8474247Z #define __SEG_GS 1 2025-05-07T20:25:06.8474557Z #define __FLT32_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F32 2025-05-07T20:25:06.8474936Z #define __SIG_ATOMIC_WIDTH__ 32 2025-05-07T20:25:06.8475210Z #define __INT_LEAST64_TYPE__ long int 2025-05-07T20:25:06.8475492Z #define __INT16_TYPE__ short int 2025-05-07T20:25:06.8475772Z #define __INT_LEAST8_TYPE__ signed char 2025-05-07T20:25:06.8476065Z #define __STDC_VERSION__ 201710L 2025-05-07T20:25:06.8476325Z #define __SIZEOF_INT__ 4 2025-05-07T20:25:06.8476579Z #define __DEC32_MAX_EXP__ 97 2025-05-07T20:25:06.8476842Z #define __INT_FAST8_MAX__ 0x7f 2025-05-07T20:25:06.8477219Z #define __FLT128_MAX__ 1.18973149535723176508575932662800702e+4932F128 2025-05-07T20:25:06.8477614Z #define __INTPTR_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:06.8477908Z #define linux 1 2025-05-07T20:25:06.8478130Z #define __FLT64_HAS_QUIET_NAN__ 1 2025-05-07T20:25:06.8478414Z #define __FLT32_MIN_10_EXP__ (-37) 2025-05-07T20:25:06.8478685Z #define __FLT32X_DIG__ 15 2025-05-07T20:25:06.8478937Z #define __PTRDIFF_WIDTH__ 64 2025-05-07T20:25:06.8479194Z #define __LDBL_MANT_DIG__ 64 2025-05-07T20:25:06.8479453Z #define __FLT64_HAS_INFINITY__ 1 2025-05-07T20:25:06.8479796Z #define __FLT64X_MAX__ 1.18973149535723176502126385303097021e+4932F64x 2025-05-07T20:25:06.8480194Z #define __SIG_ATOMIC_MIN__ (-__SIG_ATOMIC_MAX__ - 1) 2025-05-07T20:25:06.8480519Z #define __code_model_small__ 1 2025-05-07T20:25:06.8480792Z #define __GCC_ATOMIC_LONG_LOCK_FREE 2 2025-05-07T20:25:06.8481065Z #define __DEC32_MANT_DIG__ 7 2025-05-07T20:25:06.8481312Z #define __k8__ 1 2025-05-07T20:25:06.8481635Z #define __INTPTR_TYPE__ long int 2025-05-07T20:25:06.8481924Z #define __UINT16_TYPE__ short unsigned int 2025-05-07T20:25:06.8482224Z #define __WCHAR_TYPE__ int 2025-05-07T20:25:06.8482547Z #define __pic__ 2 2025-05-07T20:25:06.8482798Z #define __UINTPTR_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:06.8483113Z #define __INT_FAST64_WIDTH__ 64 2025-05-07T20:25:06.8483410Z #define __INT_FAST64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:06.8483742Z #define __GCC_ATOMIC_TEST_AND_SET_TRUEVAL 1 2025-05-07T20:25:06.8484102Z #define __FLT_NORM_MAX__ 3.40282346638528859811704183484516925e+38F 2025-05-07T20:25:06.8484465Z #define __FLT32_HAS_INFINITY__ 1 2025-05-07T20:25:06.8484744Z #define __FLT64X_MAX_EXP__ 16384 2025-05-07T20:25:06.8485034Z #define __UINT_FAST64_TYPE__ long unsigned int 2025-05-07T20:25:06.8485346Z #define __INT_MAX__ 0x7fffffff 2025-05-07T20:25:06.8485603Z #define __linux__ 1 2025-05-07T20:25:06.8485830Z #define __INT64_TYPE__ long int 2025-05-07T20:25:06.8486101Z #define __FLT_MAX_EXP__ 128 2025-05-07T20:25:06.8486371Z #define __ORDER_BIG_ENDIAN__ 4321 2025-05-07T20:25:06.8486643Z #define __DBL_MANT_DIG__ 53 2025-05-07T20:25:06.8486902Z #define __SIZEOF_FLOAT128__ 16 2025-05-07T20:25:06.8487209Z #define __INT_LEAST64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:06.8487541Z #define __GCC_ATOMIC_CHAR16_T_LOCK_FREE 2 2025-05-07T20:25:06.8487880Z #define __DEC64_MIN__ 1E-383DD 2025-05-07T20:25:06.8488153Z #define __WINT_TYPE__ unsigned int 2025-05-07T20:25:06.8488450Z #define __UINT_LEAST32_TYPE__ unsigned int 2025-05-07T20:25:06.8488740Z #define __SIZEOF_SHORT__ 2 2025-05-07T20:25:06.8489075Z #define __FLT32_NORM_MAX__ 3.40282346638528859811704183484516925e+38F32 2025-05-07T20:25:06.8489433Z #define __SSE__ 1 2025-05-07T20:25:06.8489659Z #define __LDBL_MIN_EXP__ (-16381) 2025-05-07T20:25:06.8490307Z #define __FLT64_MAX__ 1.79769313486231570814527423731704357e+308F64 2025-05-07T20:25:06.8490717Z #define __amd64__ 1 2025-05-07T20:25:06.8490938Z #define __WINT_WIDTH__ 32 2025-05-07T20:25:06.8491192Z #define __INT_LEAST8_MAX__ 0x7f 2025-05-07T20:25:06.8491470Z #define __INT_LEAST64_WIDTH__ 64 2025-05-07T20:25:06.8491737Z #define __LDBL_MAX_EXP__ 16384 2025-05-07T20:25:06.8492008Z #define __FLT32X_MAX_10_EXP__ 308 2025-05-07T20:25:06.8492290Z #define __SIZEOF_INT128__ 16 2025-05-07T20:25:06.8492551Z #define __FLT64X_IS_IEC_60559__ 2 2025-05-07T20:25:06.8492819Z #define __LDBL_MAX_10_EXP__ 4932 2025-05-07T20:25:06.8493082Z #define __ATOMIC_RELAXED 0 2025-05-07T20:25:06.8493430Z #define __DBL_EPSILON__ ((double)2.22044604925031308084726333618164062e-16L) 2025-05-07T20:25:06.8493891Z #define __FLT128_MIN__ 3.36210314311209350626267781732175260e-4932F128 2025-05-07T20:25:06.8494243Z #define _LP64 1 2025-05-07T20:25:06.8494463Z #define __UINT8_C(c) c 2025-05-07T20:25:06.8494706Z #define __FLT64_MAX_EXP__ 1024 2025-05-07T20:25:06.8494965Z #define __INT_LEAST32_TYPE__ int 2025-05-07T20:25:06.8495236Z #define __SIZEOF_WCHAR_T__ 4 2025-05-07T20:25:06.8495512Z #define __UINT64_TYPE__ long unsigned int 2025-05-07T20:25:06.8495817Z #define __GNUC_PATCHLEVEL__ 0 2025-05-07T20:25:06.8496168Z #define __FLT128_NORM_MAX__ 1.18973149535723176508575932662800702e+4932F128 2025-05-07T20:25:06.8496630Z #define __FLT64_NORM_MAX__ 1.79769313486231570814527423731704357e+308F64 2025-05-07T20:25:06.8497001Z #define __FLT128_HAS_QUIET_NAN__ 1 2025-05-07T20:25:06.8497304Z #define __INTMAX_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:06.8497612Z #define __INT_FAST8_TYPE__ signed char 2025-05-07T20:25:06.8497971Z #define __FLT64X_MIN__ 3.36210314311209350626267781732175260e-4932F64x 2025-05-07T20:25:06.8498329Z #define __GNUC_STDC_INLINE__ 1 2025-05-07T20:25:06.8498594Z #define __FLT64_HAS_DENORM__ 1 2025-05-07T20:25:06.8498931Z #define __FLT32_EPSILON__ 1.19209289550781250000000000000000000e-7F32 2025-05-07T20:25:06.8499291Z #define __DBL_DECIMAL_DIG__ 17 2025-05-07T20:25:06.8499548Z #define __STDC_UTF_32__ 1 2025-05-07T20:25:06.8499875Z #define __INT_FAST8_WIDTH__ 8 2025-05-07T20:25:06.8500120Z #define __FXSR__ 1 2025-05-07T20:25:06.8500619Z #define __FLT32X_MAX__ 1.79769313486231570814527423731704357e+308F32x 2025-05-07T20:25:06.8501074Z #define __DBL_NORM_MAX__ ((double)1.79769313486231570814527423731704357e+308L) 2025-05-07T20:25:06.8501602Z #define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__ 2025-05-07T20:25:06.8501900Z #define __INTMAX_WIDTH__ 64 2025-05-07T20:25:06.8502158Z #define __UINT32_C(c) c ## U 2025-05-07T20:25:06.8502490Z #define __FLT_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F 2025-05-07T20:25:06.8502836Z #define __INT8_MAX__ 0x7f 2025-05-07T20:25:06.8503081Z #define __LONG_WIDTH__ 64 2025-05-07T20:25:06.8503322Z #define __PIC__ 2 2025-05-07T20:25:06.8503567Z #define __UINT_FAST32_TYPE__ long unsigned int 2025-05-07T20:25:06.8503959Z #define __FLT32X_NORM_MAX__ 1.79769313486231570814527423731704357e+308F32x 2025-05-07T20:25:06.8504340Z #define __CHAR32_TYPE__ unsigned int 2025-05-07T20:25:06.8504666Z #define __FLT_MAX__ 3.40282346638528859811704183484516925e+38F 2025-05-07T20:25:06.8504993Z #define __SSE2__ 1 2025-05-07T20:25:06.8505217Z #define __INT32_TYPE__ int 2025-05-07T20:25:06.8505466Z #define __SIZEOF_DOUBLE__ 8 2025-05-07T20:25:06.8505714Z #define __FLT_MIN_10_EXP__ (-37) 2025-05-07T20:25:06.8506054Z #define __FLT64_MIN__ 2.22507385850720138309023271733240406e-308F64 2025-05-07T20:25:06.8506405Z #define __INT_LEAST32_WIDTH__ 32 2025-05-07T20:25:06.8506670Z #define __INTMAX_TYPE__ long int 2025-05-07T20:25:06.8506939Z #define __DEC128_MAX_EXP__ 6145 2025-05-07T20:25:06.8507206Z #define __FLT32X_HAS_QUIET_NAN__ 1 2025-05-07T20:25:06.8507473Z #define __ATOMIC_CONSUME 1 2025-05-07T20:25:06.8507720Z #define __GNUC_MINOR__ 4 2025-05-07T20:25:06.8507967Z #define __INT_FAST16_WIDTH__ 64 2025-05-07T20:25:06.8508249Z #define __UINTMAX_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:06.8508545Z #define __PIE__ 2 2025-05-07T20:25:06.8508867Z #define __FLT32X_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F32x 2025-05-07T20:25:06.8509244Z #define __DBL_MAX_10_EXP__ 308 2025-05-07T20:25:06.8509599Z #define __LDBL_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951L 2025-05-07T20:25:06.8509964Z #define __INT16_C(c) c 2025-05-07T20:25:06.8510190Z #define __STDC__ 1 2025-05-07T20:25:06.8510421Z #define __PTRDIFF_TYPE__ long int 2025-05-07T20:25:06.8510699Z #define __ATOMIC_SEQ_CST 5 2025-05-07T20:25:06.8510956Z #define __FLT32X_MIN_10_EXP__ (-307) 2025-05-07T20:25:06.8511251Z #define __UINTPTR_TYPE__ long unsigned int 2025-05-07T20:25:06.8511596Z #define __DEC64_SUBNORMAL_MIN__ 0.000000000000001E-383DD 2025-05-07T20:25:06.8511925Z #define __DEC128_MANT_DIG__ 34 2025-05-07T20:25:06.8512189Z #define __LDBL_MIN_10_EXP__ (-4931) 2025-05-07T20:25:06.8512470Z #define __SIZEOF_LONG_LONG__ 8 2025-05-07T20:25:06.8512737Z #define __FLT128_DECIMAL_DIG__ 36 2025-05-07T20:25:06.8513013Z #define __GCC_ATOMIC_LLONG_LOCK_FREE 2 2025-05-07T20:25:06.8513300Z #define __FLT32_HAS_QUIET_NAN__ 1 2025-05-07T20:25:06.8513572Z #define __FLT_DECIMAL_DIG__ 9 2025-05-07T20:25:06.8513862Z #define __UINT_FAST16_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:06.8514252Z #define __LDBL_NORM_MAX__ 1.18973149535723176502126385303097021e+4932L 2025-05-07T20:25:06.8514623Z #define __GCC_ATOMIC_SHORT_LOCK_FREE 2 2025-05-07T20:25:06.8514922Z #define __UINT_FAST8_TYPE__ unsigned char 2025-05-07T20:25:06.8515212Z #define __ATOMIC_ACQ_REL 4 2025-05-07T20:25:06.8515461Z #define __ATOMIC_RELEASE 3 2025-05-07T20:25:06.8515618Z 2025-05-07T20:25:06.9032776Z 2025-05-07T20:25:06.9033106Z [INFO] Printing out all preprocessor defines in the C++ compiler ... 2025-05-07T20:25:06.9033542Z + conda run -n build_binary c++ -dM -E -x c++ - 2025-05-07T20:25:06.9033772Z 2025-05-07T20:25:08.7949659Z #define __DBL_MIN_EXP__ (-1021) 2025-05-07T20:25:08.7950255Z #define __cpp_attributes 200809L 2025-05-07T20:25:08.7950733Z #define __cpp_nontype_template_parameter_auto 201606L 2025-05-07T20:25:08.7951239Z #define __UINT_LEAST16_MAX__ 0xffff 2025-05-07T20:25:08.7951661Z #define __ATOMIC_ACQUIRE 2 2025-05-07T20:25:08.7952032Z #define __FLT128_MAX_10_EXP__ 4932 2025-05-07T20:25:08.7952882Z #define __FLT_MIN__ 1.17549435082228750796873653722224568e-38F 2025-05-07T20:25:08.7953241Z #define __GCC_IEC_559_COMPLEX 2 2025-05-07T20:25:08.7953525Z #define __cpp_aggregate_nsdmi 201304L 2025-05-07T20:25:08.7953981Z #define __UINT_LEAST8_TYPE__ unsigned char 2025-05-07T20:25:08.7954288Z #define __SIZEOF_FLOAT80__ 16 2025-05-07T20:25:08.7954554Z #define __INTMAX_C(c) c ## L 2025-05-07T20:25:08.7954806Z #define __CHAR_BIT__ 8 2025-05-07T20:25:08.7955045Z #define __UINT8_MAX__ 0xff 2025-05-07T20:25:08.7955291Z #define __SCHAR_WIDTH__ 8 2025-05-07T20:25:08.7955538Z #define __WINT_MAX__ 0xffffffffU 2025-05-07T20:25:08.7955813Z #define __FLT32_MIN_EXP__ (-125) 2025-05-07T20:25:08.7956094Z #define __cpp_static_assert 201411L 2025-05-07T20:25:08.7956374Z #define __ORDER_LITTLE_ENDIAN__ 1234 2025-05-07T20:25:08.7956676Z #define __SIZE_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:08.7956976Z #define __WCHAR_MAX__ 0x7fffffff 2025-05-07T20:25:08.7957265Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_1 1 2025-05-07T20:25:08.7957589Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_2 1 2025-05-07T20:25:08.7957966Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4 1 2025-05-07T20:25:08.7958368Z #define __DBL_DENORM_MIN__ double(4.94065645841246544176568792868221372e-324L) 2025-05-07T20:25:08.7958777Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_8 1 2025-05-07T20:25:08.7959090Z #define __GCC_ATOMIC_CHAR_LOCK_FREE 2 2025-05-07T20:25:08.7959370Z #define __GCC_IEC_559 2 2025-05-07T20:25:08.7959608Z #define __FLT32X_DECIMAL_DIG__ 17 2025-05-07T20:25:08.7959884Z #define __FLT_EVAL_METHOD__ 0 2025-05-07T20:25:08.7960161Z #define __cpp_binary_literals 201304L 2025-05-07T20:25:08.7960441Z #define __FLT64_DECIMAL_DIG__ 17 2025-05-07T20:25:08.7960734Z #define __cpp_noexcept_function_type 201510L 2025-05-07T20:25:08.7961054Z #define __GCC_ATOMIC_CHAR32_T_LOCK_FREE 2 2025-05-07T20:25:08.7961363Z #define __cpp_variadic_templates 200704L 2025-05-07T20:25:08.7961690Z #define __UINT_FAST64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:08.7962011Z #define __SIG_ATOMIC_TYPE__ int 2025-05-07T20:25:08.7962291Z #define __DBL_MIN_10_EXP__ (-307) 2025-05-07T20:25:08.7962561Z #define __FINITE_MATH_ONLY__ 0 2025-05-07T20:25:08.7962839Z #define __cpp_variable_templates 201304L 2025-05-07T20:25:08.7963145Z #define __FLT32X_MAX_EXP__ 1024 2025-05-07T20:25:08.7963403Z #define __FLT32_HAS_DENORM__ 1 2025-05-07T20:25:08.7963666Z #define __UINT_FAST8_MAX__ 0xff 2025-05-07T20:25:08.7963943Z #define __cpp_rvalue_reference 200610L 2025-05-07T20:25:08.7964265Z #define __cpp_nested_namespace_definitions 201411L 2025-05-07T20:25:08.7964598Z #define __DEC64_MAX_EXP__ 385 2025-05-07T20:25:08.7964855Z #define __INT8_C(c) c 2025-05-07T20:25:08.7965093Z #define __INT_LEAST8_WIDTH__ 8 2025-05-07T20:25:08.7965369Z #define __cpp_variadic_using 201611L 2025-05-07T20:25:08.7965690Z #define __UINT_LEAST64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:08.7966012Z #define __INT_LEAST8_MAX__ 0x7f 2025-05-07T20:25:08.7966280Z #define __cpp_capture_star_this 201603L 2025-05-07T20:25:08.7966570Z #define __SHRT_MAX__ 0x7fff 2025-05-07T20:25:08.7966889Z #define __LDBL_MAX__ 1.18973149535723176502126385303097021e+4932L 2025-05-07T20:25:08.7967234Z #define __FLT64X_MAX_10_EXP__ 4932 2025-05-07T20:25:08.7967524Z #define __cpp_if_constexpr 201606L 2025-05-07T20:25:08.7967801Z #define __LDBL_IS_IEC_60559__ 2 2025-05-07T20:25:08.7968061Z #define __FLT64X_HAS_QUIET_NAN__ 1 2025-05-07T20:25:08.7968339Z #define __UINT_LEAST8_MAX__ 0xff 2025-05-07T20:25:08.7968614Z #define __GCC_ATOMIC_BOOL_LOCK_FREE 2 2025-05-07T20:25:08.7968993Z #define __FLT128_DENORM_MIN__ 6.47517511943802511092443895822764655e-4966F128 2025-05-07T20:25:08.7969396Z #define __UINTMAX_TYPE__ long unsigned int 2025-05-07T20:25:08.7969680Z #define __linux 1 2025-05-07T20:25:08.7969909Z #define __DEC32_EPSILON__ 1E-6DF 2025-05-07T20:25:08.7970183Z #define __FLT_EVAL_METHOD_TS_18661_3__ 0 2025-05-07T20:25:08.7970462Z #define __unix 1 2025-05-07T20:25:08.7970685Z #define __UINT32_MAX__ 0xffffffffU 2025-05-07T20:25:08.7970963Z #define __GXX_EXPERIMENTAL_CXX0X__ 1 2025-05-07T20:25:08.7971340Z #define __FLT128_MIN_EXP__ (-16381) 2025-05-07T20:25:08.7971614Z #define __WINT_MIN__ 0U 2025-05-07T20:25:08.7971853Z #define __FLT128_MIN_10_EXP__ (-4931) 2025-05-07T20:25:08.7972209Z #define __FLT32X_IS_IEC_60559__ 2 2025-05-07T20:25:08.7972482Z #define __INT_LEAST16_WIDTH__ 16 2025-05-07T20:25:08.7972744Z #define __SCHAR_MAX__ 0x7f 2025-05-07T20:25:08.7972994Z #define __FLT128_MANT_DIG__ 113 2025-05-07T20:25:08.7973274Z #define __WCHAR_MIN__ (-__WCHAR_MAX__ - 1) 2025-05-07T20:25:08.7973561Z #define __INT64_C(c) c ## L 2025-05-07T20:25:08.7973824Z #define __GCC_ATOMIC_POINTER_LOCK_FREE 2 2025-05-07T20:25:08.7974118Z #define __FLT32X_MANT_DIG__ 53 2025-05-07T20:25:08.7974389Z #define __GCC_ATOMIC_CHAR16_T_LOCK_FREE 2 2025-05-07T20:25:08.7974683Z #define __cpp_aligned_new 201606L 2025-05-07T20:25:08.7974958Z #define __USER_LABEL_PREFIX__ 2025-05-07T20:25:08.7975219Z #define __FLT32_MAX_10_EXP__ 38 2025-05-07T20:25:08.7975563Z #define __FLT64X_EPSILON__ 1.08420217248550443400745280086994171e-19F64x 2025-05-07T20:25:08.7975939Z #define __STDC_HOSTED__ 1 2025-05-07T20:25:08.7976193Z #define __DEC64_MIN_EXP__ (-382) 2025-05-07T20:25:08.7976462Z #define __cpp_decltype_auto 201304L 2025-05-07T20:25:08.7976748Z #define __DBL_DIG__ 15 2025-05-07T20:25:08.7976978Z #define __FLT32_DIG__ 6 2025-05-07T20:25:08.7977277Z #define __FLT_EPSILON__ 1.19209289550781250000000000000000000e-7F 2025-05-07T20:25:08.7977625Z #define __GXX_WEAK__ 1 2025-05-07T20:25:08.7977860Z #define __SHRT_WIDTH__ 16 2025-05-07T20:25:08.7978101Z #define __FLT32_IS_IEC_60559__ 2 2025-05-07T20:25:08.7978420Z #define __LDBL_MIN__ 3.36210314311209350626267781732175260e-4932L 2025-05-07T20:25:08.7978763Z #define __DBL_IS_IEC_60559__ 2 2025-05-07T20:25:08.7979041Z #define __DEC32_MAX__ 9.999999E96DF 2025-05-07T20:25:08.7979341Z #define __cpp_threadsafe_static_init 200806L 2025-05-07T20:25:08.7979671Z #define __cpp_enumerator_attributes 201411L 2025-05-07T20:25:08.7980167Z #define __FLT64X_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951F64x 2025-05-07T20:25:08.7980565Z #define __FLT32X_HAS_INFINITY__ 1 2025-05-07T20:25:08.7980839Z #define __INT32_MAX__ 0x7fffffff 2025-05-07T20:25:08.7981097Z #define __unix__ 1 2025-05-07T20:25:08.7981320Z #define __INT_WIDTH__ 32 2025-05-07T20:25:08.7981574Z #define __SIZEOF_LONG__ 8 2025-05-07T20:25:08.7981821Z #define __STDC_IEC_559__ 1 2025-05-07T20:25:08.7982068Z #define __STDC_ISO_10646__ 201103L 2025-05-07T20:25:08.7982333Z #define __UINT16_C(c) c 2025-05-07T20:25:08.7982568Z #define __DECIMAL_DIG__ 21 2025-05-07T20:25:08.7982819Z #define __STDC_IEC_559_COMPLEX__ 1 2025-05-07T20:25:08.7983180Z #define __FLT64_EPSILON__ 2.22044604925031308084726333618164062e-16F64 2025-05-07T20:25:08.7992283Z #define __gnu_linux__ 1 2025-05-07T20:25:08.7992559Z #define __INT16_MAX__ 0x7fff 2025-05-07T20:25:08.7992838Z #define __FLT64_MIN_EXP__ (-1021) 2025-05-07T20:25:08.7993133Z #define __FLT64X_MIN_10_EXP__ (-4931) 2025-05-07T20:25:08.7993427Z #define __LDBL_HAS_QUIET_NAN__ 1 2025-05-07T20:25:08.7993693Z #define __FLT64_MANT_DIG__ 53 2025-05-07T20:25:08.7993972Z #define __FLT64X_MANT_DIG__ 64 2025-05-07T20:25:08.7994228Z #define __GNUC__ 11 2025-05-07T20:25:08.7994444Z #define __GXX_RTTI 1 2025-05-07T20:25:08.7994680Z #define __pie__ 2 2025-05-07T20:25:08.7994905Z #define __MMX__ 1 2025-05-07T20:25:08.7995127Z #define __FLT_HAS_DENORM__ 1 2025-05-07T20:25:08.7995408Z #define __SIZEOF_LONG_DOUBLE__ 16 2025-05-07T20:25:08.7995697Z #define __BIGGEST_ALIGNMENT__ 16 2025-05-07T20:25:08.7995963Z #define __STDC_UTF_16__ 1 2025-05-07T20:25:08.7996223Z #define __FLT64_MAX_10_EXP__ 308 2025-05-07T20:25:08.7996530Z #define __cpp_delegating_constructors 200604L 2025-05-07T20:25:08.7996848Z #define __FLT32_HAS_INFINITY__ 1 2025-05-07T20:25:08.7997200Z #define __DBL_MAX__ double(1.79769313486231570814527423731704357e+308L) 2025-05-07T20:25:08.7997580Z #define __cpp_raw_strings 200710L 2025-05-07T20:25:08.7997895Z #define __INT_FAST32_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:08.7998207Z #define __DBL_HAS_INFINITY__ 1 2025-05-07T20:25:08.8000256Z #define __SIZEOF_FLOAT__ 4 2025-05-07T20:25:08.8000541Z #define __HAVE_SPECULATION_SAFE_VALUE 1 2025-05-07T20:25:08.8000849Z #define __cpp_fold_expressions 201603L 2025-05-07T20:25:08.8001272Z #define __DEC32_MIN_EXP__ (-94) 2025-05-07T20:25:08.8001548Z #define __INTPTR_WIDTH__ 64 2025-05-07T20:25:08.8001805Z #define __FLT64X_HAS_INFINITY__ 1 2025-05-07T20:25:08.8002097Z #define __UINT_LEAST32_MAX__ 0xffffffffU 2025-05-07T20:25:08.8002395Z #define __FLT32X_HAS_DENORM__ 1 2025-05-07T20:25:08.8002663Z #define __INT_FAST16_TYPE__ long int 2025-05-07T20:25:08.8002942Z #define __MMX_WITH_SSE__ 1 2025-05-07T20:25:08.8003204Z #define __LDBL_HAS_DENORM__ 1 2025-05-07T20:25:08.8003473Z #define __cplusplus 201703L 2025-05-07T20:25:08.8003747Z #define __cpp_ref_qualifiers 200710L 2025-05-07T20:25:08.8004035Z #define __DEC32_MIN__ 1E-95DF 2025-05-07T20:25:08.8004294Z #define __DEPRECATED 1 2025-05-07T20:25:08.8004542Z #define __cpp_rvalue_references 200610L 2025-05-07T20:25:08.8004838Z #define __DBL_MAX_EXP__ 1024 2025-05-07T20:25:08.8005109Z #define __WCHAR_WIDTH__ 32 2025-05-07T20:25:08.8005425Z #define __FLT32_MAX__ 3.40282346638528859811704183484516925e+38F32 2025-05-07T20:25:08.8005795Z #define __DEC128_EPSILON__ 1E-33DL 2025-05-07T20:25:08.8006130Z #define __SSE2_MATH__ 1 2025-05-07T20:25:08.8006372Z #define __ATOMIC_HLE_RELEASE 131072 2025-05-07T20:25:08.8006679Z #define __PTRDIFF_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:08.8006971Z #define __amd64 1 2025-05-07T20:25:08.8007192Z #define __STDC_NO_THREADS__ 1 2025-05-07T20:25:08.8007467Z #define __ATOMIC_HLE_ACQUIRE 65536 2025-05-07T20:25:08.8007776Z #define __GNUG__ 11 2025-05-07T20:25:08.8008042Z #define __LONG_LONG_MAX__ 0x7fffffffffffffffLL 2025-05-07T20:25:08.8008361Z #define __SIZEOF_SIZE_T__ 8 2025-05-07T20:25:08.8008628Z #define __cpp_nsdmi 200809L 2025-05-07T20:25:08.8008896Z #define __FLT64X_MIN_EXP__ (-16381) 2025-05-07T20:25:08.8009168Z #define __SIZEOF_WINT_T__ 4 2025-05-07T20:25:08.8009432Z #define __LONG_LONG_WIDTH__ 64 2025-05-07T20:25:08.8009720Z #define __cpp_initializer_lists 200806L 2025-05-07T20:25:08.8010018Z #define __FLT32_MAX_EXP__ 128 2025-05-07T20:25:08.8010292Z #define __cpp_hex_float 201603L 2025-05-07T20:25:08.8010570Z #define __GXX_ABI_VERSION 1016 2025-05-07T20:25:08.8010835Z #define __FLT128_HAS_INFINITY__ 1 2025-05-07T20:25:08.8011117Z #define __FLT_MIN_EXP__ (-125) 2025-05-07T20:25:08.8011388Z #define __GCC_HAVE_DWARF2_CFI_ASM 1 2025-05-07T20:25:08.8011654Z #define __x86_64 1 2025-05-07T20:25:08.8011888Z #define __cpp_lambdas 200907L 2025-05-07T20:25:08.8012166Z #define __INT_FAST64_TYPE__ long int 2025-05-07T20:25:08.8012534Z #define __FLT64_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F64 2025-05-07T20:25:08.8012929Z #define __cpp_template_auto 201606L 2025-05-07T20:25:08.8013292Z #define __DBL_MIN__ double(2.22507385850720138309023271733240406e-308L) 2025-05-07T20:25:08.8013746Z #define __FLT128_EPSILON__ 1.92592994438723585305597794258492732e-34F128 2025-05-07T20:25:08.8014218Z #define __FLT64X_NORM_MAX__ 1.18973149535723176502126385303097021e+4932F64x 2025-05-07T20:25:08.8014612Z #define __SIZEOF_POINTER__ 8 2025-05-07T20:25:08.8014868Z #define __LP64__ 1 2025-05-07T20:25:08.8015101Z #define __DBL_HAS_QUIET_NAN__ 1 2025-05-07T20:25:08.8015457Z #define __FLT32X_EPSILON__ 2.22044604925031308084726333618164062e-16F32x 2025-05-07T20:25:08.8015842Z #define __DECIMAL_BID_FORMAT__ 1 2025-05-07T20:25:08.8016114Z #define __FLT64_MIN_10_EXP__ (-307) 2025-05-07T20:25:08.8016406Z #define __FLT64X_DECIMAL_DIG__ 21 2025-05-07T20:25:08.8016689Z #define __DEC128_MIN__ 1E-6143DL 2025-05-07T20:25:08.8016969Z #define __REGISTER_PREFIX__ 2025-05-07T20:25:08.8017227Z #define __UINT16_MAX__ 0xffff 2025-05-07T20:25:08.8017496Z #define __LDBL_HAS_INFINITY__ 1 2025-05-07T20:25:08.8017839Z #define __FLT32_MIN__ 1.17549435082228750796873653722224568e-38F32 2025-05-07T20:25:08.8018191Z #define __UINT8_TYPE__ unsigned char 2025-05-07T20:25:08.8018469Z #define __FLT_DIG__ 6 2025-05-07T20:25:08.8018706Z #define __NO_INLINE__ 1 2025-05-07T20:25:08.8019080Z #define __DEC_EVAL_METHOD__ 2 2025-05-07T20:25:08.8019414Z #define __DEC128_MAX__ 9.999999999999999999999999999999999E6144DL 2025-05-07T20:25:08.8019762Z #define __FLT_MANT_DIG__ 24 2025-05-07T20:25:08.8020271Z #define __LDBL_DECIMAL_DIG__ 21 2025-05-07T20:25:08.8020537Z #define __VERSION__ "11.4.0" 2025-05-07T20:25:08.8020797Z #define __UINT64_C(c) c ## UL 2025-05-07T20:25:08.8021069Z #define __cpp_unicode_characters 201411L 2025-05-07T20:25:08.8021367Z #define _STDC_PREDEF_H 1 2025-05-07T20:25:08.8021629Z #define __INT_LEAST32_MAX__ 0x7fffffff 2025-05-07T20:25:08.8021923Z #define __GCC_ATOMIC_INT_LOCK_FREE 2 2025-05-07T20:25:08.8022203Z #define __FLT128_MAX_EXP__ 16384 2025-05-07T20:25:08.8022480Z #define __FLT32_MANT_DIG__ 24 2025-05-07T20:25:08.8022792Z #define __FLOAT_WORD_ORDER__ __ORDER_LITTLE_ENDIAN__ 2025-05-07T20:25:08.8023125Z #define __cpp_aggregate_bases 201603L 2025-05-07T20:25:08.8023423Z #define __FLT128_HAS_DENORM__ 1 2025-05-07T20:25:08.8023693Z #define __FLT32_DECIMAL_DIG__ 9 2025-05-07T20:25:08.8023955Z #define __FLT128_DIG__ 33 2025-05-07T20:25:08.8024201Z #define __INT32_C(c) c 2025-05-07T20:25:08.8024446Z #define __DEC64_EPSILON__ 1E-15DD 2025-05-07T20:25:08.8024732Z #define __ORDER_PDP_ENDIAN__ 3412 2025-05-07T20:25:08.8025012Z #define __DEC128_MIN_EXP__ (-6142) 2025-05-07T20:25:08.8025298Z #define __INT_FAST32_TYPE__ long int 2025-05-07T20:25:08.8025607Z #define __UINT_LEAST16_TYPE__ short unsigned int 2025-05-07T20:25:08.8025915Z #define unix 1 2025-05-07T20:25:08.8026141Z #define __DBL_HAS_DENORM__ 1 2025-05-07T20:25:08.8026407Z #define __cpp_rtti 199711L 2025-05-07T20:25:08.8026668Z #define __SIZE_TYPE__ long unsigned int 2025-05-07T20:25:08.8026991Z #define __UINT64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:08.8027302Z #define __FLT_IS_IEC_60559__ 2 2025-05-07T20:25:08.8027610Z #define __GNUC_WIDE_EXECUTION_CHARSET_NAME "UTF-32LE" 2025-05-07T20:25:08.8027943Z #define __FLT64X_DIG__ 18 2025-05-07T20:25:08.8028201Z #define __INT8_TYPE__ signed char 2025-05-07T20:25:08.8028488Z #define __cpp_digit_separators 201309L 2025-05-07T20:25:08.8028777Z #define __ELF__ 1 2025-05-07T20:25:08.8029013Z #define __GCC_ASM_FLAG_OUTPUTS__ 1 2025-05-07T20:25:08.8029297Z #define __UINT32_TYPE__ unsigned int 2025-05-07T20:25:08.8029576Z #define __FLT_RADIX__ 2 2025-05-07T20:25:08.8029827Z #define __INT_LEAST16_TYPE__ short int 2025-05-07T20:25:08.8030177Z #define __LDBL_EPSILON__ 1.08420217248550443400745280086994171e-19L 2025-05-07T20:25:08.8030543Z #define __UINTMAX_C(c) c ## UL 2025-05-07T20:25:08.8030822Z #define __GLIBCXX_BITSIZE_INT_N_0 128 2025-05-07T20:25:08.8031101Z #define __k8 1 2025-05-07T20:25:08.8031393Z #define __FLT32X_MIN__ 2.22507385850720138309023271733240406e-308F32x 2025-05-07T20:25:08.8031764Z #define __SIG_ATOMIC_MAX__ 0x7fffffff 2025-05-07T20:25:08.8032060Z #define __GCC_ATOMIC_WCHAR_T_LOCK_FREE 2 2025-05-07T20:25:08.8032356Z #define __SIZEOF_PTRDIFF_T__ 8 2025-05-07T20:25:08.8032620Z #define __LDBL_DIG__ 18 2025-05-07T20:25:08.8032870Z #define __FLT64_IS_IEC_60559__ 2 2025-05-07T20:25:08.8033129Z #define __x86_64__ 1 2025-05-07T20:25:08.8033368Z #define __FLT32X_MIN_EXP__ (-1021) 2025-05-07T20:25:08.8033670Z #define __DEC32_SUBNORMAL_MIN__ 0.000001E-95DF 2025-05-07T20:25:08.8034005Z #define __INT_FAST16_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:08.8034311Z #define __FLT64_DIG__ 15 2025-05-07T20:25:08.8034596Z #define __UINT_FAST32_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:08.8034943Z #define __UINT_LEAST64_TYPE__ long unsigned int 2025-05-07T20:25:08.8035251Z #define __FLT_HAS_QUIET_NAN__ 1 2025-05-07T20:25:08.8035520Z #define __FLT_MAX_10_EXP__ 38 2025-05-07T20:25:08.8035801Z #define __LONG_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:08.8036093Z #define __FLT64X_HAS_DENORM__ 1 2025-05-07T20:25:08.8036463Z #define __DEC128_SUBNORMAL_MIN__ 0.000000000000000000000000000000001E-6143DL 2025-05-07T20:25:08.8036856Z #define __FLT_HAS_INFINITY__ 1 2025-05-07T20:25:08.8037143Z #define __GNUC_EXECUTION_CHARSET_NAME "UTF-8" 2025-05-07T20:25:08.8037471Z #define __cpp_unicode_literals 200710L 2025-05-07T20:25:08.8037875Z #define __UINT_FAST16_TYPE__ long unsigned int 2025-05-07T20:25:08.8038195Z #define __DEC64_MAX__ 9.999999999999999E384DD 2025-05-07T20:25:08.8038567Z #define __INT_FAST32_WIDTH__ 64 2025-05-07T20:25:08.8038855Z #define __CHAR16_TYPE__ short unsigned int 2025-05-07T20:25:08.8039165Z #define __PRAGMA_REDEFINE_EXTNAME 1 2025-05-07T20:25:08.8039441Z #define __SIZE_WIDTH__ 64 2025-05-07T20:25:08.8039685Z #define __SEG_FS 1 2025-05-07T20:25:08.8039920Z #define __INT_LEAST16_MAX__ 0x7fff 2025-05-07T20:25:08.8040193Z #define __DEC64_MANT_DIG__ 16 2025-05-07T20:25:08.8040472Z #define __INT64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:08.8040760Z #define __SEG_GS 1 2025-05-07T20:25:08.8041069Z #define __FLT32_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F32 2025-05-07T20:25:08.8041452Z #define __SIG_ATOMIC_WIDTH__ 32 2025-05-07T20:25:08.8041733Z #define __INT_LEAST64_TYPE__ long int 2025-05-07T20:25:08.8042016Z #define __INT16_TYPE__ short int 2025-05-07T20:25:08.8042305Z #define __INT_LEAST8_TYPE__ signed char 2025-05-07T20:25:08.8042617Z #define __cpp_structured_bindings 201606L 2025-05-07T20:25:08.8042909Z #define __SIZEOF_INT__ 4 2025-05-07T20:25:08.8043167Z #define __DEC32_MAX_EXP__ 97 2025-05-07T20:25:08.8043431Z #define __INT_FAST8_MAX__ 0x7f 2025-05-07T20:25:08.8043774Z #define __FLT128_MAX__ 1.18973149535723176508575932662800702e+4932F128 2025-05-07T20:25:08.8044151Z #define __INTPTR_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:08.8044466Z #define __cpp_sized_deallocation 201309L 2025-05-07T20:25:08.8044792Z #define __cpp_guaranteed_copy_elision 201606L 2025-05-07T20:25:08.8045086Z #define linux 1 2025-05-07T20:25:08.8045318Z #define __FLT64_HAS_QUIET_NAN__ 1 2025-05-07T20:25:08.8045598Z #define __FLT32_MIN_10_EXP__ (-37) 2025-05-07T20:25:08.8045868Z #define __EXCEPTIONS 1 2025-05-07T20:25:08.8046119Z #define __PTRDIFF_WIDTH__ 64 2025-05-07T20:25:08.8046385Z #define __LDBL_MANT_DIG__ 64 2025-05-07T20:25:08.8046647Z #define __cpp_range_based_for 201603L 2025-05-07T20:25:08.8046942Z #define __FLT64_HAS_INFINITY__ 1 2025-05-07T20:25:08.8047289Z #define __FLT64X_MAX__ 1.18973149535723176502126385303097021e+4932F64x 2025-05-07T20:25:08.8047691Z #define __STDCPP_DEFAULT_NEW_ALIGNMENT__ 16 2025-05-07T20:25:08.8048076Z #define __SIG_ATOMIC_MIN__ (-__SIG_ATOMIC_MAX__ - 1) 2025-05-07T20:25:08.8048405Z #define __code_model_small__ 1 2025-05-07T20:25:08.8048680Z #define __GCC_ATOMIC_LONG_LOCK_FREE 2 2025-05-07T20:25:08.8048990Z #define __cpp_nontype_template_args 201411L 2025-05-07T20:25:08.8049285Z #define __DEC32_MANT_DIG__ 7 2025-05-07T20:25:08.8049565Z #define __cpp_return_type_deduction 201304L 2025-05-07T20:25:08.8049855Z #define __k8__ 1 2025-05-07T20:25:08.8050079Z #define __INTPTR_TYPE__ long int 2025-05-07T20:25:08.8050370Z #define __UINT16_TYPE__ short unsigned int 2025-05-07T20:25:08.8050671Z #define __WCHAR_TYPE__ int 2025-05-07T20:25:08.8050911Z #define __pic__ 2 2025-05-07T20:25:08.8051168Z #define __UINTPTR_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:08.8051480Z #define __INT_FAST64_WIDTH__ 64 2025-05-07T20:25:08.8051749Z #define __cpp_decltype 200707L 2025-05-07T20:25:08.8052048Z #define __INT_FAST64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:08.8052384Z #define __GCC_ATOMIC_TEST_AND_SET_TRUEVAL 1 2025-05-07T20:25:08.8052742Z #define __FLT_NORM_MAX__ 3.40282346638528859811704183484516925e+38F 2025-05-07T20:25:08.8053101Z #define __FLT64X_MAX_EXP__ 16384 2025-05-07T20:25:08.8053396Z #define __UINT_FAST64_TYPE__ long unsigned int 2025-05-07T20:25:08.8053718Z #define __cpp_inline_variables 201606L 2025-05-07T20:25:08.8054002Z #define __INT_MAX__ 0x7fffffff 2025-05-07T20:25:08.8054259Z #define __linux__ 1 2025-05-07T20:25:08.8054489Z #define __INT64_TYPE__ long int 2025-05-07T20:25:08.8054747Z #define __FLT_MAX_EXP__ 128 2025-05-07T20:25:08.8055012Z #define __ORDER_BIG_ENDIAN__ 4321 2025-05-07T20:25:08.8055285Z #define __DBL_MANT_DIG__ 53 2025-05-07T20:25:08.8055563Z #define __cpp_inheriting_constructors 201511L 2025-05-07T20:25:08.8055881Z #define __SIZEOF_FLOAT128__ 16 2025-05-07T20:25:08.8056261Z #define __INT_LEAST64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:08.8056574Z #define __DEC64_MIN__ 1E-383DD 2025-05-07T20:25:08.8056845Z #define __WINT_TYPE__ unsigned int 2025-05-07T20:25:08.8057219Z #define __UINT_LEAST32_TYPE__ unsigned int 2025-05-07T20:25:08.8057514Z #define __SIZEOF_SHORT__ 2 2025-05-07T20:25:08.8057863Z #define __FLT32_NORM_MAX__ 3.40282346638528859811704183484516925e+38F32 2025-05-07T20:25:08.8058235Z #define __SSE__ 1 2025-05-07T20:25:08.8058459Z #define __LDBL_MIN_EXP__ (-16381) 2025-05-07T20:25:08.8058790Z #define __FLT64_MAX__ 1.79769313486231570814527423731704357e+308F64 2025-05-07T20:25:08.8059126Z #define __amd64__ 1 2025-05-07T20:25:08.8059348Z #define __WINT_WIDTH__ 32 2025-05-07T20:25:08.8059596Z #define __INT_LEAST64_WIDTH__ 64 2025-05-07T20:25:08.8059932Z #define __LDBL_MAX_EXP__ 16384 2025-05-07T20:25:08.8060199Z #define __FLT32X_MAX_10_EXP__ 308 2025-05-07T20:25:08.8060462Z #define __SIZEOF_INT128__ 16 2025-05-07T20:25:08.8060717Z #define __FLT64X_IS_IEC_60559__ 2 2025-05-07T20:25:08.8060996Z #define __LDBL_MAX_10_EXP__ 4932 2025-05-07T20:25:08.8061254Z #define __ATOMIC_RELAXED 0 2025-05-07T20:25:08.8061595Z #define __DBL_EPSILON__ double(2.22044604925031308084726333618164062e-16L) 2025-05-07T20:25:08.8062063Z #define __FLT128_MIN__ 3.36210314311209350626267781732175260e-4932F128 2025-05-07T20:25:08.8062409Z #define _LP64 1 2025-05-07T20:25:08.8062615Z #define __UINT8_C(c) c 2025-05-07T20:25:08.8062852Z #define __FLT64_MAX_EXP__ 1024 2025-05-07T20:25:08.8063116Z #define __INT_LEAST32_TYPE__ int 2025-05-07T20:25:08.8063375Z #define __SIZEOF_WCHAR_T__ 4 2025-05-07T20:25:08.8063635Z #define __GNUC_PATCHLEVEL__ 0 2025-05-07T20:25:08.8063989Z #define __FLT128_NORM_MAX__ 1.18973149535723176508575932662800702e+4932F128 2025-05-07T20:25:08.8064443Z #define __FLT64_NORM_MAX__ 1.79769313486231570814527423731704357e+308F64 2025-05-07T20:25:08.8064819Z #define __FLT128_HAS_QUIET_NAN__ 1 2025-05-07T20:25:08.8065114Z #define __INTMAX_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:08.8065427Z #define __INT_FAST8_TYPE__ signed char 2025-05-07T20:25:08.8065729Z #define __cpp_namespace_attributes 201411L 2025-05-07T20:25:08.8066105Z #define __FLT64X_MIN__ 3.36210314311209350626267781732175260e-4932F64x 2025-05-07T20:25:08.8066473Z #define __STDCPP_THREADS__ 1 2025-05-07T20:25:08.8066729Z #define __GNUC_STDC_INLINE__ 1 2025-05-07T20:25:08.8066993Z #define __FLT64_HAS_DENORM__ 1 2025-05-07T20:25:08.8067324Z #define __FLT32_EPSILON__ 1.19209289550781250000000000000000000e-7F32 2025-05-07T20:25:08.8067676Z #define __DBL_DECIMAL_DIG__ 17 2025-05-07T20:25:08.8067931Z #define __STDC_UTF_32__ 1 2025-05-07T20:25:08.8068180Z #define __INT_FAST8_WIDTH__ 8 2025-05-07T20:25:08.8068419Z #define __FXSR__ 1 2025-05-07T20:25:08.8068722Z #define __FLT32X_MAX__ 1.79769313486231570814527423731704357e+308F32x 2025-05-07T20:25:08.8069165Z #define __DBL_NORM_MAX__ double(1.79769313486231570814527423731704357e+308L) 2025-05-07T20:25:08.8069565Z #define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__ 2025-05-07T20:25:08.8069866Z #define __INTMAX_WIDTH__ 64 2025-05-07T20:25:08.8070131Z #define __cpp_runtime_arrays 198712L 2025-05-07T20:25:08.8070423Z #define __UINT64_TYPE__ long unsigned int 2025-05-07T20:25:08.8070707Z #define __UINT32_C(c) c ## U 2025-05-07T20:25:08.8070974Z #define __cpp_alias_templates 200704L 2025-05-07T20:25:08.8071330Z #define __FLT_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F 2025-05-07T20:25:08.8071680Z #define __FLT128_IS_IEC_60559__ 2 2025-05-07T20:25:08.8071940Z #define __INT8_MAX__ 0x7f 2025-05-07T20:25:08.8072183Z #define __LONG_WIDTH__ 64 2025-05-07T20:25:08.8072410Z #define __PIC__ 2 2025-05-07T20:25:08.8072663Z #define __UINT_FAST32_TYPE__ long unsigned int 2025-05-07T20:25:08.8073060Z #define __FLT32X_NORM_MAX__ 1.79769313486231570814527423731704357e+308F32x 2025-05-07T20:25:08.8073443Z #define __CHAR32_TYPE__ unsigned int 2025-05-07T20:25:08.8073765Z #define __FLT_MAX__ 3.40282346638528859811704183484516925e+38F 2025-05-07T20:25:08.8074108Z #define __cpp_constexpr 201603L 2025-05-07T20:25:08.8074449Z #define __SSE2__ 1 2025-05-07T20:25:08.8074688Z #define __cpp_deduction_guides 201703L 2025-05-07T20:25:08.8074973Z #define __INT32_TYPE__ int 2025-05-07T20:25:08.8075351Z #define __SIZEOF_DOUBLE__ 8 2025-05-07T20:25:08.8075611Z #define __cpp_exceptions 199711L 2025-05-07T20:25:08.8075886Z #define __FLT_MIN_10_EXP__ (-37) 2025-05-07T20:25:08.8076215Z #define __FLT64_MIN__ 2.22507385850720138309023271733240406e-308F64 2025-05-07T20:25:08.8076562Z #define __INT_LEAST32_WIDTH__ 32 2025-05-07T20:25:08.8076833Z #define __INTMAX_TYPE__ long int 2025-05-07T20:25:08.8077097Z #define __DEC128_MAX_EXP__ 6145 2025-05-07T20:25:08.8077358Z #define __FLT32X_HAS_QUIET_NAN__ 1 2025-05-07T20:25:08.8077630Z #define __ATOMIC_CONSUME 1 2025-05-07T20:25:08.8077877Z #define __GNUC_MINOR__ 4 2025-05-07T20:25:08.8078132Z #define __GLIBCXX_TYPE_INT_N_0 __int128 2025-05-07T20:25:08.8078415Z #define __INT_FAST16_WIDTH__ 64 2025-05-07T20:25:08.8078703Z #define __UINTMAX_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:08.8079003Z #define __PIE__ 2 2025-05-07T20:25:08.8079318Z #define __FLT32X_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F32x 2025-05-07T20:25:08.8079739Z #define __cpp_template_template_args 201611L 2025-05-07T20:25:08.8080041Z #define __DBL_MAX_10_EXP__ 308 2025-05-07T20:25:08.8080378Z #define __LDBL_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951L 2025-05-07T20:25:08.8080739Z #define __INT16_C(c) c 2025-05-07T20:25:08.8080963Z #define __STDC__ 1 2025-05-07T20:25:08.8081176Z #define __FLT32X_DIG__ 15 2025-05-07T20:25:08.8081432Z #define __PTRDIFF_TYPE__ long int 2025-05-07T20:25:08.8081706Z #define __ATOMIC_SEQ_CST 5 2025-05-07T20:25:08.8081961Z #define __FLT32X_MIN_10_EXP__ (-307) 2025-05-07T20:25:08.8082252Z #define __UINTPTR_TYPE__ long unsigned int 2025-05-07T20:25:08.8082595Z #define __DEC64_SUBNORMAL_MIN__ 0.000000000000001E-383DD 2025-05-07T20:25:08.8082922Z #define __DEC128_MANT_DIG__ 34 2025-05-07T20:25:08.8083182Z #define __LDBL_MIN_10_EXP__ (-4931) 2025-05-07T20:25:08.8083475Z #define __cpp_generic_lambdas 201304L 2025-05-07T20:25:08.8083749Z #define __SSE_MATH__ 1 2025-05-07T20:25:08.8083983Z #define __SIZEOF_LONG_LONG__ 8 2025-05-07T20:25:08.8084270Z #define __cpp_user_defined_literals 200809L 2025-05-07T20:25:08.8084572Z #define __FLT128_DECIMAL_DIG__ 36 2025-05-07T20:25:08.8084846Z #define __GCC_ATOMIC_LLONG_LOCK_FREE 2 2025-05-07T20:25:08.8085135Z #define __FLT32_HAS_QUIET_NAN__ 1 2025-05-07T20:25:08.8085406Z #define __FLT_DECIMAL_DIG__ 9 2025-05-07T20:25:08.8085695Z #define __UINT_FAST16_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:08.8086088Z #define __LDBL_NORM_MAX__ 1.18973149535723176502126385303097021e+4932L 2025-05-07T20:25:08.8086453Z #define __GCC_ATOMIC_SHORT_LOCK_FREE 2 2025-05-07T20:25:08.8086750Z #define __UINT_FAST8_TYPE__ unsigned char 2025-05-07T20:25:08.8087029Z #define _GNU_SOURCE 1 2025-05-07T20:25:08.8087276Z #define __cpp_init_captures 201304L 2025-05-07T20:25:08.8087554Z #define __ATOMIC_ACQ_REL 4 2025-05-07T20:25:08.8087827Z #define __ATOMIC_RELEASE 3 2025-05-07T20:25:08.8088019Z 2025-05-07T20:25:08.8602243Z 2025-05-07T20:25:08.8602589Z + conda run -n build_binary c++ --version 2025-05-07T20:25:08.8602885Z 2025-05-07T20:25:10.7478011Z c++ (conda-forge gcc 11.4.0-13) 11.4.0 2025-05-07T20:25:10.7478561Z Copyright (C) 2021 Free Software Foundation, Inc. 2025-05-07T20:25:10.7479035Z This is free software; see the source for copying conditions. There is NO 2025-05-07T20:25:10.7479567Z warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. 2025-05-07T20:25:10.7479889Z 2025-05-07T20:25:10.7479894Z 2025-05-07T20:25:10.8104501Z 2025-05-07T20:25:10.8105197Z [INFO] Printing the default version of the C standard used by the compiler ... 2025-05-07T20:25:10.8105848Z + conda run -n build_binary cc -dM -E - < /dev/null | grep __STDC_VERSION__ 2025-05-07T20:25:12.7594497Z 2025-05-07T20:25:12.7594933Z #define __STDC_VERSION__ 201710L 2025-05-07T20:25:12.7597058Z 2025-05-07T20:25:12.7597899Z [INFO] Printing the default version of the C++ standard used by the compiler ... 2025-05-07T20:25:12.7598685Z + conda run -n build_binary c++ -dM -E -x c++ - < /dev/null | grep __cplusplus 2025-05-07T20:25:12.7599078Z 2025-05-07T20:25:14.7069644Z #define __cplusplus 201703L 2025-05-07T20:25:14.7072241Z 2025-05-07T20:25:14.7072935Z [INSTALL] Successfully installed C/C++ compilers 2025-05-07T20:25:14.7119482Z ##[group]Run . $PRELUDE; install_cuda $BUILD_ENV 12.6.3 2025-05-07T20:25:14.7119931Z . $PRELUDE; install_cuda $BUILD_ENV 12.6.3 2025-05-07T20:25:14.7132295Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:25:14.7132642Z env: 2025-05-07T20:25:14.7132862Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:25:14.7133162Z BUILD_ENV: build_binary 2025-05-07T20:25:14.7133402Z BUILD_TARGET: genai 2025-05-07T20:25:14.7133622Z BUILD_VARIANT: cuda 2025-05-07T20:25:14.7133852Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:25:14.7134107Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:25:14.7134397Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:25:14.7134722Z ##[endgroup] 2025-05-07T20:25:15.0522579Z ################################################################################ 2025-05-07T20:25:15.0522966Z # Install CUDA 2025-05-07T20:25:15.0523184Z # 2025-05-07T20:25:15.0539011Z # [2025-05-07T20:25:15.053Z] + install_cuda build_binary 12.6.3 2025-05-07T20:25:15.0539467Z ################################################################################ 2025-05-07T20:25:15.0539769Z 2025-05-07T20:25:15.0556415Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:25:15.1433431Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:25:15.1433793Z [SETUP] Cleaning up Conda packages ... 2025-05-07T20:25:15.1438637Z + conda clean --packages --tarball -y 2025-05-07T20:25:15.1438940Z 2025-05-07T20:25:15.8560382Z Will remove 32 (142.2 MB) tarball(s). 2025-05-07T20:25:15.8560792Z Will remove 6 (617 KB) package(s). 2025-05-07T20:25:15.9199233Z 2025-05-07T20:25:15.9207936Z + conda clean --all -y 2025-05-07T20:25:15.9208164Z 2025-05-07T20:25:16.5927925Z There are no unused tarball(s) to remove. 2025-05-07T20:25:16.5928344Z Will remove 1 index cache(s). 2025-05-07T20:25:16.5928668Z There are no unused package(s) to remove. 2025-05-07T20:25:16.5928984Z There are no tempfile(s) to remove. 2025-05-07T20:25:16.5929280Z There are no logfile(s) to remove. 2025-05-07T20:25:16.6574627Z 2025-05-07T20:25:16.6589476Z [INSTALL] Installing CUDA 12.6.3 ... 2025-05-07T20:25:16.6614215Z [EXEC] [ATTEMPT 0/3] + conda install --force-reinstall -n build_binary -c conda-forge --override-channels -y cuda=12.6.3 2025-05-07T20:25:17.5742966Z Channels: 2025-05-07T20:25:17.5743223Z - conda-forge 2025-05-07T20:25:17.5743452Z Platform: linux-64 2025-05-07T20:25:28.0713914Z Collecting package metadata (repodata.json): - \ | / - \ | / - \ | / - \ | / - done 2025-05-07T20:25:29.1633966Z Solving environment: | / - \ | done 2025-05-07T20:25:29.2372733Z 2025-05-07T20:25:29.2373378Z ## Package Plan ## 2025-05-07T20:25:29.2373614Z 2025-05-07T20:25:29.2373921Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:25:29.2374352Z 2025-05-07T20:25:29.2374495Z added / updated specs: 2025-05-07T20:25:29.2374836Z - cuda=12.6.3 2025-05-07T20:25:29.2375003Z 2025-05-07T20:25:29.2375033Z 2025-05-07T20:25:29.2375196Z The following packages will be downloaded: 2025-05-07T20:25:29.2375497Z 2025-05-07T20:25:29.2375667Z package | build 2025-05-07T20:25:29.2376051Z ---------------------------|----------------- 2025-05-07T20:25:29.2376558Z alsa-lib-1.2.14 | hb9d3cd8_0 553 KB conda-forge 2025-05-07T20:25:29.2377133Z attr-2.5.1 | h166bdaf_1 69 KB conda-forge 2025-05-07T20:25:29.2377676Z binutils-2.40 | h4852527_7 31 KB conda-forge 2025-05-07T20:25:29.2378154Z c-compiler-1.5.2 | h0b41bf4_0 6 KB conda-forge 2025-05-07T20:25:29.2378561Z cuda-12.6.3 | ha804496_0 26 KB conda-forge 2025-05-07T20:25:29.2378979Z cuda-cccl_linux-64-12.6.77 | ha770c72_0 1.0 MB conda-forge 2025-05-07T20:25:29.2379939Z cuda-command-line-tools-12.6.3| ha770c72_0 20 KB conda-forge 2025-05-07T20:25:29.2380641Z cuda-compiler-12.6.3 | hbad6d8a_0 20 KB conda-forge 2025-05-07T20:25:29.2381110Z cuda-crt-dev_linux-64-12.6.85| ha770c72_0 87 KB conda-forge 2025-05-07T20:25:29.2381582Z cuda-crt-tools-12.6.85 | ha770c72_0 26 KB conda-forge 2025-05-07T20:25:29.2382022Z cuda-cudart-12.6.77 | h5888daf_0 22 KB conda-forge 2025-05-07T20:25:29.2382480Z cuda-cudart-dev-12.6.77 | h5888daf_0 22 KB conda-forge 2025-05-07T20:25:29.2382966Z cuda-cudart-dev_linux-64-12.6.77| h3f2d84a_0 357 KB conda-forge 2025-05-07T20:25:29.2383517Z cuda-cudart-static-12.6.77 | h5888daf_0 22 KB conda-forge 2025-05-07T20:25:29.2384080Z cuda-cudart-static_linux-64-12.6.77| h3f2d84a_0 744 KB conda-forge 2025-05-07T20:25:29.2384597Z cuda-cudart_linux-64-12.6.77| h3f2d84a_0 184 KB conda-forge 2025-05-07T20:25:29.2385067Z cuda-cuobjdump-12.6.77 | hbd13f7d_1 241 KB conda-forge 2025-05-07T20:25:29.2385506Z cuda-cupti-12.6.80 | hbd13f7d_0 1.9 MB conda-forge 2025-05-07T20:25:29.2385943Z cuda-cupti-dev-12.6.80 | h5888daf_0 3.4 MB conda-forge 2025-05-07T20:25:29.2386392Z cuda-cuxxfilt-12.6.77 | hbd13f7d_1 211 KB conda-forge 2025-05-07T20:25:29.2386838Z cuda-driver-dev-12.6.77 | h5888daf_0 22 KB conda-forge 2025-05-07T20:25:29.2387321Z cuda-driver-dev_linux-64-12.6.77| h3f2d84a_0 35 KB conda-forge 2025-05-07T20:25:29.2387775Z cuda-gdb-12.6.77 | h50b4baa_1 370 KB conda-forge 2025-05-07T20:25:29.2388206Z cuda-libraries-12.6.3 | ha770c72_0 20 KB conda-forge 2025-05-07T20:25:29.2388666Z cuda-libraries-dev-12.6.3 | ha770c72_0 20 KB conda-forge 2025-05-07T20:25:29.2389131Z cuda-nsight-12.6.77 | h7938cbb_0 113.2 MB conda-forge 2025-05-07T20:25:29.2389554Z cuda-nvcc-12.6.85 | hcdd1206_0 23 KB conda-forge 2025-05-07T20:25:29.2390297Z cuda-nvcc-dev_linux-64-12.6.85| he91c749_0 10.8 MB conda-forge 2025-05-07T20:25:29.2390754Z cuda-nvcc-impl-12.6.85 | h85509e4_0 25 KB conda-forge 2025-05-07T20:25:29.2391204Z cuda-nvcc-tools-12.6.85 | he02047a_0 23.0 MB conda-forge 2025-05-07T20:25:29.2391657Z cuda-nvcc_linux-64-12.6.85 | h04802cd_0 25 KB conda-forge 2025-05-07T20:25:29.2392103Z cuda-nvdisasm-12.6.77 | hbd13f7d_1 47.6 MB conda-forge 2025-05-07T20:25:29.2392546Z cuda-nvml-dev-12.6.77 | hbd13f7d_1 159 KB conda-forge 2025-05-07T20:25:29.2392984Z cuda-nvprof-12.6.80 | hbd13f7d_0 2.6 MB conda-forge 2025-05-07T20:25:29.2393426Z cuda-nvprune-12.6.77 | hbd13f7d_1 66 KB conda-forge 2025-05-07T20:25:29.2393862Z cuda-nvrtc-12.6.85 | hbd13f7d_0 17.3 MB conda-forge 2025-05-07T20:25:29.2394296Z cuda-nvrtc-dev-12.6.85 | h5888daf_0 31 KB conda-forge 2025-05-07T20:25:29.2394727Z cuda-nvtx-12.6.77 | hbd13f7d_0 31 KB conda-forge 2025-05-07T20:25:29.2395167Z cuda-nvvm-dev_linux-64-12.6.85| ha770c72_0 25 KB conda-forge 2025-05-07T20:25:29.2395626Z cuda-nvvm-impl-12.6.85 | he02047a_0 7.7 MB conda-forge 2025-05-07T20:25:29.2396073Z cuda-nvvm-tools-12.6.85 | he02047a_0 10.4 MB conda-forge 2025-05-07T20:25:29.2396508Z cuda-nvvp-12.6.80 | hbd13f7d_1 109.3 MB conda-forge 2025-05-07T20:25:29.2396928Z cuda-opencl-12.6.77 | hbd13f7d_0 29 KB conda-forge 2025-05-07T20:25:29.2397377Z cuda-opencl-dev-12.6.77 | h5888daf_0 93 KB conda-forge 2025-05-07T20:25:29.2397982Z cuda-profiler-api-12.6.77 | h7938cbb_0 22 KB conda-forge 2025-05-07T20:25:29.2398579Z cuda-runtime-12.6.3 | ha804496_0 19 KB conda-forge 2025-05-07T20:25:29.2399037Z cuda-sanitizer-api-12.6.77 | hbd13f7d_1 8.9 MB conda-forge 2025-05-07T20:25:29.2399492Z cuda-toolkit-12.6.3 | ha804496_0 19 KB conda-forge 2025-05-07T20:25:29.2399917Z cuda-tools-12.6.3 | ha770c72_0 19 KB conda-forge 2025-05-07T20:25:29.2400332Z cuda-version-12.6 | h7480c83_3 20 KB conda-forge 2025-05-07T20:25:29.2400779Z cuda-visual-tools-12.6.3 | ha770c72_0 19 KB conda-forge 2025-05-07T20:25:29.2401229Z cxx-compiler-1.5.2 | hf52228f_0 6 KB conda-forge 2025-05-07T20:25:29.2401633Z dbus-1.13.6 | h5008d03_3 604 KB conda-forge 2025-05-07T20:25:29.2402016Z expat-2.7.0 | h5888daf_0 137 KB conda-forge 2025-05-07T20:25:29.2402478Z font-ttf-dejavu-sans-mono-2.37| hab24e00_0 388 KB conda-forge 2025-05-07T20:25:29.2402991Z font-ttf-inconsolata-3.000 | h77eed37_0 94 KB conda-forge 2025-05-07T20:25:29.2403486Z font-ttf-source-code-pro-2.038| h77eed37_0 684 KB conda-forge 2025-05-07T20:25:29.2403970Z font-ttf-ubuntu-0.83 | h77eed37_3 1.5 MB conda-forge 2025-05-07T20:25:29.2404406Z fontconfig-2.15.0 | h7e30c49_1 259 KB conda-forge 2025-05-07T20:25:29.2404857Z fonts-conda-ecosystem-1 | 0 4 KB conda-forge 2025-05-07T20:25:29.2405314Z fonts-conda-forge-1 | 0 4 KB conda-forge 2025-05-07T20:25:29.2405751Z freetype-2.13.3 | ha770c72_1 168 KB conda-forge 2025-05-07T20:25:29.2406142Z gcc-11.4.0 | h602e360_13 49 KB conda-forge 2025-05-07T20:25:29.2406534Z gds-tools-1.11.1.6 | h5888daf_4 37.8 MB conda-forge 2025-05-07T20:25:29.2406938Z gmp-6.3.0 | hac33072_2 449 KB conda-forge 2025-05-07T20:25:29.2407303Z gxx-11.4.0 | h602e360_13 49 KB conda-forge 2025-05-07T20:25:29.2407710Z keyutils-1.6.1 | h166bdaf_0 115 KB conda-forge 2025-05-07T20:25:29.2408099Z krb5-1.21.3 | h659f571_0 1.3 MB conda-forge 2025-05-07T20:25:29.2408484Z libcap-2.71 | h39aace5_0 100 KB conda-forge 2025-05-07T20:25:29.2408886Z libcublas-12.6.4.1 | h5888daf_1 256.2 MB conda-forge 2025-05-07T20:25:29.2409321Z libcublas-dev-12.6.4.1 | h5888daf_1 88 KB conda-forge 2025-05-07T20:25:29.2409756Z libcufft-11.3.0.4 | hbd13f7d_0 156.2 MB conda-forge 2025-05-07T20:25:29.2410190Z libcufft-dev-11.3.0.4 | h5888daf_0 33 KB conda-forge 2025-05-07T20:25:29.2410621Z libcufile-1.11.1.6 | h12f29b5_4 900 KB conda-forge 2025-05-07T20:25:29.2411060Z libcufile-dev-1.11.1.6 | h5888daf_4 35 KB conda-forge 2025-05-07T20:25:29.2411498Z libcurand-10.3.7.77 | hbd13f7d_0 39.9 MB conda-forge 2025-05-07T20:25:29.2411941Z libcurand-dev-10.3.7.77 | h5888daf_0 262 KB conda-forge 2025-05-07T20:25:29.2412380Z libcusolver-11.7.1.2 | h5888daf_1 95.8 MB conda-forge 2025-05-07T20:25:29.2412831Z libcusolver-dev-11.7.1.2 | h5888daf_1 59 KB conda-forge 2025-05-07T20:25:29.2413284Z libcusparse-12.5.4.2 | hbd13f7d_0 118.6 MB conda-forge 2025-05-07T20:25:29.2413732Z libcusparse-dev-12.5.4.2 | h5888daf_0 51 KB conda-forge 2025-05-07T20:25:29.2414239Z libedit-3.1.20191231 | he28a2e2_2 121 KB conda-forge 2025-05-07T20:25:29.2414761Z libexpat-2.7.0 | h5888daf_0 73 KB conda-forge 2025-05-07T20:25:29.2415267Z libfreetype-2.13.3 | ha770c72_1 8 KB conda-forge 2025-05-07T20:25:29.2415701Z libfreetype6-2.13.3 | h48d6fc4_1 371 KB conda-forge 2025-05-07T20:25:29.2416147Z libgcrypt-lib-1.11.0 | hb9d3cd8_2 572 KB conda-forge 2025-05-07T20:25:29.2416579Z libglib-2.84.0 | h2ff4ddf_0 3.8 MB conda-forge 2025-05-07T20:25:29.2416995Z libgpg-error-1.55 | h3f2d84a_0 305 KB conda-forge 2025-05-07T20:25:29.2417416Z libiconv-1.18 | h4ce23a2_1 696 KB conda-forge 2025-05-07T20:25:29.2417817Z libnl-3.11.0 | hb9d3cd8_0 724 KB conda-forge 2025-05-07T20:25:29.2418223Z libnpp-12.3.1.54 | h5888daf_0 93.4 MB conda-forge 2025-05-07T20:25:29.2418638Z libnpp-dev-12.3.1.54 | h5888daf_0 441 KB conda-forge 2025-05-07T20:25:29.2419063Z libnsl-2.0.1 | hd590300_0 33 KB conda-forge 2025-05-07T20:25:29.2419462Z libnuma-2.0.18 | h4ab18f5_2 42 KB conda-forge 2025-05-07T20:25:29.2419963Z libnvfatbin-12.6.77 | hbd13f7d_0 783 KB conda-forge 2025-05-07T20:25:29.2420421Z libnvfatbin-dev-12.6.77 | h5888daf_0 26 KB conda-forge 2025-05-07T20:25:29.2420878Z libnvjitlink-12.6.85 | hbd13f7d_0 14.9 MB conda-forge 2025-05-07T20:25:29.2421338Z libnvjitlink-dev-12.6.85 | h5888daf_0 25 KB conda-forge 2025-05-07T20:25:29.2421781Z libnvjpeg-12.3.3.54 | h5888daf_0 2.4 MB conda-forge 2025-05-07T20:25:29.2422225Z libnvjpeg-dev-12.3.3.54 | ha770c72_0 31 KB conda-forge 2025-05-07T20:25:29.2422650Z libpng-1.6.47 | h943b412_0 282 KB conda-forge 2025-05-07T20:25:29.2423066Z libsqlite-3.49.2 | hee588c1_0 895 KB conda-forge 2025-05-07T20:25:29.2423490Z libsystemd0-256.9 | h2774228_0 401 KB conda-forge 2025-05-07T20:25:29.2423922Z libudev1-257.4 | h9a4d06a_0 140 KB conda-forge 2025-05-07T20:25:29.2424333Z libuuid-2.38.1 | h0b41bf4_0 33 KB conda-forge 2025-05-07T20:25:29.2424730Z libxcb-1.17.0 | h8a09558_0 387 KB conda-forge 2025-05-07T20:25:29.2425149Z libxkbcommon-1.8.0 | hc4a0caf_0 627 KB conda-forge 2025-05-07T20:25:29.2425585Z libxkbfile-1.1.0 | h166bdaf_1 111 KB conda-forge 2025-05-07T20:25:29.2425999Z libxml2-2.13.5 | h064dc61_0 673 KB conda-forge 2025-05-07T20:25:29.2426398Z libzlib-1.3.1 | hb9d3cd8_2 60 KB conda-forge 2025-05-07T20:25:29.2426791Z lz4-c-1.9.4 | hcb278e6_0 140 KB conda-forge 2025-05-07T20:25:29.2427230Z nsight-compute-2024.3.2.3 | hb5ebaad_0 443.1 MB conda-forge 2025-05-07T20:25:29.2427655Z nspr-4.36 | h5888daf_0 225 KB conda-forge 2025-05-07T20:25:29.2428038Z nss-3.111 | h159eef7_0 1.9 MB conda-forge 2025-05-07T20:25:29.2428431Z ocl-icd-2.3.3 | hb9d3cd8_0 104 KB conda-forge 2025-05-07T20:25:29.2428874Z opencl-headers-2024.10.24 | h5888daf_0 53 KB conda-forge 2025-05-07T20:25:29.2429300Z pcre2-10.44 | hc749103_2 934 KB conda-forge 2025-05-07T20:25:29.2429725Z pthread-stubs-0.4 | hb9d3cd8_1002 8 KB conda-forge 2025-05-07T20:25:29.2430167Z python-3.10.13 |hd12c33a_1_cpython 24.5 MB conda-forge 2025-05-07T20:25:29.2430583Z rdma-core-55.0 | h5888daf_0 1.2 MB conda-forge 2025-05-07T20:25:29.2431084Z sqlite-3.32.3 | hcee41ef_1 1.4 MB conda-forge 2025-05-07T20:25:29.2431594Z tk-8.6.13 |noxft_h4845f30_101 3.2 MB conda-forge 2025-05-07T20:25:29.2431996Z wayland-1.23.1 | h3e06ad9_0 314 KB conda-forge 2025-05-07T20:25:29.2432398Z xcb-util-0.4.1 | hb711507_2 19 KB conda-forge 2025-05-07T20:25:29.2432837Z xcb-util-cursor-0.1.5 | hb9d3cd8_0 20 KB conda-forge 2025-05-07T20:25:29.2433288Z xcb-util-image-0.4.0 | hb711507_2 24 KB conda-forge 2025-05-07T20:25:29.2433740Z xcb-util-keysyms-0.4.1 | hb711507_0 14 KB conda-forge 2025-05-07T20:25:29.2434245Z xcb-util-renderutil-0.3.10 | hb711507_0 17 KB conda-forge 2025-05-07T20:25:29.2434717Z xcb-util-wm-0.4.2 | hb711507_0 50 KB conda-forge 2025-05-07T20:25:29.2435166Z xkeyboard-config-2.44 | hb9d3cd8_0 384 KB conda-forge 2025-05-07T20:25:29.2435624Z xorg-libice-1.1.2 | hb9d3cd8_0 57 KB conda-forge 2025-05-07T20:25:29.2436057Z xorg-libsm-1.2.6 | he73a12e_0 27 KB conda-forge 2025-05-07T20:25:29.2436482Z xorg-libx11-1.8.12 | h4f16b4b_0 816 KB conda-forge 2025-05-07T20:25:29.2436912Z xorg-libxau-1.0.12 | hb9d3cd8_0 14 KB conda-forge 2025-05-07T20:25:29.2437367Z xorg-libxcomposite-0.4.6 | hb9d3cd8_2 13 KB conda-forge 2025-05-07T20:25:29.2437841Z xorg-libxdamage-1.1.6 | hb9d3cd8_0 13 KB conda-forge 2025-05-07T20:25:29.2438293Z xorg-libxdmcp-1.1.5 | hb9d3cd8_0 19 KB conda-forge 2025-05-07T20:25:29.2438727Z xorg-libxext-1.3.6 | hb9d3cd8_0 49 KB conda-forge 2025-05-07T20:25:29.2439172Z xorg-libxfixes-6.0.1 | hb9d3cd8_0 19 KB conda-forge 2025-05-07T20:25:29.2439613Z xorg-libxi-1.8.2 | hb9d3cd8_0 46 KB conda-forge 2025-05-07T20:25:29.2440054Z xorg-libxrandr-1.5.4 | hb9d3cd8_0 29 KB conda-forge 2025-05-07T20:25:29.2440510Z xorg-libxrender-0.9.12 | hb9d3cd8_0 32 KB conda-forge 2025-05-07T20:25:29.2440962Z xorg-libxtst-1.2.5 | hb9d3cd8_3 32 KB conda-forge 2025-05-07T20:25:29.2441373Z zlib-1.3.1 | hb9d3cd8_2 90 KB conda-forge 2025-05-07T20:25:29.2441746Z zstd-1.5.7 | hb8e6e7a_2 554 KB conda-forge 2025-05-07T20:25:29.2442121Z ------------------------------------------------------------ 2025-05-07T20:25:29.2442463Z Total: 1.63 GB 2025-05-07T20:25:29.2442670Z 2025-05-07T20:25:29.2442807Z The following NEW packages will be INSTALLED: 2025-05-07T20:25:29.2443024Z 2025-05-07T20:25:29.2443235Z alsa-lib conda-forge/linux-64::alsa-lib-1.2.14-hb9d3cd8_0 2025-05-07T20:25:29.2443657Z attr conda-forge/linux-64::attr-2.5.1-h166bdaf_1 2025-05-07T20:25:29.2444073Z binutils conda-forge/linux-64::binutils-2.40-h4852527_7 2025-05-07T20:25:29.2444528Z c-compiler conda-forge/linux-64::c-compiler-1.5.2-h0b41bf4_0 2025-05-07T20:25:29.2444949Z cuda conda-forge/noarch::cuda-12.6.3-ha804496_0 2025-05-07T20:25:29.2445424Z cuda-cccl_linux-64 conda-forge/noarch::cuda-cccl_linux-64-12.6.77-ha770c72_0 2025-05-07T20:25:29.2446012Z cuda-command-line~ conda-forge/linux-64::cuda-command-line-tools-12.6.3-ha770c72_0 2025-05-07T20:25:29.2446587Z cuda-compiler conda-forge/noarch::cuda-compiler-12.6.3-hbad6d8a_0 2025-05-07T20:25:29.2447118Z cuda-crt-dev_linu~ conda-forge/noarch::cuda-crt-dev_linux-64-12.6.85-ha770c72_0 2025-05-07T20:25:29.2447665Z cuda-crt-tools conda-forge/linux-64::cuda-crt-tools-12.6.85-ha770c72_0 2025-05-07T20:25:29.2448181Z cuda-cudart conda-forge/linux-64::cuda-cudart-12.6.77-h5888daf_0 2025-05-07T20:25:29.2448789Z cuda-cudart-dev conda-forge/linux-64::cuda-cudart-dev-12.6.77-h5888daf_0 2025-05-07T20:25:29.2449436Z cuda-cudart-dev_l~ conda-forge/noarch::cuda-cudart-dev_linux-64-12.6.77-h3f2d84a_0 2025-05-07T20:25:29.2450041Z cuda-cudart-static conda-forge/linux-64::cuda-cudart-static-12.6.77-h5888daf_0 2025-05-07T20:25:29.2452355Z cuda-cudart-stati~ conda-forge/noarch::cuda-cudart-static_linux-64-12.6.77-h3f2d84a_0 2025-05-07T20:25:29.2452962Z cuda-cudart_linux~ conda-forge/noarch::cuda-cudart_linux-64-12.6.77-h3f2d84a_0 2025-05-07T20:25:29.2453516Z cuda-cuobjdump conda-forge/linux-64::cuda-cuobjdump-12.6.77-hbd13f7d_1 2025-05-07T20:25:29.2454031Z cuda-cupti conda-forge/linux-64::cuda-cupti-12.6.80-hbd13f7d_0 2025-05-07T20:25:29.2454528Z cuda-cupti-dev conda-forge/linux-64::cuda-cupti-dev-12.6.80-h5888daf_0 2025-05-07T20:25:29.2455045Z cuda-cuxxfilt conda-forge/linux-64::cuda-cuxxfilt-12.6.77-hbd13f7d_1 2025-05-07T20:25:29.2455597Z cuda-driver-dev conda-forge/linux-64::cuda-driver-dev-12.6.77-h5888daf_0 2025-05-07T20:25:29.2456174Z cuda-driver-dev_l~ conda-forge/noarch::cuda-driver-dev_linux-64-12.6.77-h3f2d84a_0 2025-05-07T20:25:29.2456701Z cuda-gdb conda-forge/linux-64::cuda-gdb-12.6.77-h50b4baa_1 2025-05-07T20:25:29.2457176Z cuda-libraries conda-forge/linux-64::cuda-libraries-12.6.3-ha770c72_0 2025-05-07T20:25:29.2457728Z cuda-libraries-dev conda-forge/linux-64::cuda-libraries-dev-12.6.3-ha770c72_0 2025-05-07T20:25:29.2458266Z cuda-nsight conda-forge/linux-64::cuda-nsight-12.6.77-h7938cbb_0 2025-05-07T20:25:29.2458739Z cuda-nvcc conda-forge/linux-64::cuda-nvcc-12.6.85-hcdd1206_0 2025-05-07T20:25:29.2459242Z cuda-nvcc-dev_lin~ conda-forge/noarch::cuda-nvcc-dev_linux-64-12.6.85-he91c749_0 2025-05-07T20:25:29.2459939Z cuda-nvcc-impl conda-forge/linux-64::cuda-nvcc-impl-12.6.85-h85509e4_0 2025-05-07T20:25:29.2460477Z cuda-nvcc-tools conda-forge/linux-64::cuda-nvcc-tools-12.6.85-he02047a_0 2025-05-07T20:25:29.2461033Z cuda-nvcc_linux-64 conda-forge/linux-64::cuda-nvcc_linux-64-12.6.85-h04802cd_0 2025-05-07T20:25:29.2461555Z cuda-nvdisasm conda-forge/linux-64::cuda-nvdisasm-12.6.77-hbd13f7d_1 2025-05-07T20:25:29.2462066Z cuda-nvml-dev conda-forge/linux-64::cuda-nvml-dev-12.6.77-hbd13f7d_1 2025-05-07T20:25:29.2462561Z cuda-nvprof conda-forge/linux-64::cuda-nvprof-12.6.80-hbd13f7d_0 2025-05-07T20:25:29.2463051Z cuda-nvprune conda-forge/linux-64::cuda-nvprune-12.6.77-hbd13f7d_1 2025-05-07T20:25:29.2463530Z cuda-nvrtc conda-forge/linux-64::cuda-nvrtc-12.6.85-hbd13f7d_0 2025-05-07T20:25:29.2464027Z cuda-nvrtc-dev conda-forge/linux-64::cuda-nvrtc-dev-12.6.85-h5888daf_0 2025-05-07T20:25:29.2464520Z cuda-nvtx conda-forge/linux-64::cuda-nvtx-12.6.77-hbd13f7d_0 2025-05-07T20:25:29.2465032Z cuda-nvvm-dev_lin~ conda-forge/noarch::cuda-nvvm-dev_linux-64-12.6.85-ha770c72_0 2025-05-07T20:25:29.2465581Z cuda-nvvm-impl conda-forge/linux-64::cuda-nvvm-impl-12.6.85-he02047a_0 2025-05-07T20:25:29.2466121Z cuda-nvvm-tools conda-forge/linux-64::cuda-nvvm-tools-12.6.85-he02047a_0 2025-05-07T20:25:29.2466706Z cuda-nvvp conda-forge/linux-64::cuda-nvvp-12.6.80-hbd13f7d_1 2025-05-07T20:25:29.2467373Z cuda-opencl conda-forge/linux-64::cuda-opencl-12.6.77-hbd13f7d_0 2025-05-07T20:25:29.2468076Z cuda-opencl-dev conda-forge/linux-64::cuda-opencl-dev-12.6.77-h5888daf_0 2025-05-07T20:25:29.2468735Z cuda-profiler-api conda-forge/linux-64::cuda-profiler-api-12.6.77-h7938cbb_0 2025-05-07T20:25:29.2469271Z cuda-runtime conda-forge/noarch::cuda-runtime-12.6.3-ha804496_0 2025-05-07T20:25:29.2469799Z cuda-sanitizer-api conda-forge/linux-64::cuda-sanitizer-api-12.6.77-hbd13f7d_1 2025-05-07T20:25:29.2470336Z cuda-toolkit conda-forge/noarch::cuda-toolkit-12.6.3-ha804496_0 2025-05-07T20:25:29.2470814Z cuda-tools conda-forge/linux-64::cuda-tools-12.6.3-ha770c72_0 2025-05-07T20:25:29.2471636Z cuda-version conda-forge/noarch::cuda-version-12.6-h7480c83_3 2025-05-07T20:25:29.2472466Z cuda-visual-tools conda-forge/linux-64::cuda-visual-tools-12.6.3-ha770c72_0 2025-05-07T20:25:29.2473007Z cxx-compiler conda-forge/linux-64::cxx-compiler-1.5.2-hf52228f_0 2025-05-07T20:25:29.2473448Z dbus conda-forge/linux-64::dbus-1.13.6-h5008d03_3 2025-05-07T20:25:29.2473840Z expat conda-forge/linux-64::expat-2.7.0-h5888daf_0 2025-05-07T20:25:29.2474343Z font-ttf-dejavu-s~ conda-forge/noarch::font-ttf-dejavu-sans-mono-2.37-hab24e00_0 2025-05-07T20:25:29.2475157Z font-ttf-inconsol~ conda-forge/noarch::font-ttf-inconsolata-3.000-h77eed37_0 2025-05-07T20:25:29.2475966Z font-ttf-source-c~ conda-forge/noarch::font-ttf-source-code-pro-2.038-h77eed37_0 2025-05-07T20:25:29.2476531Z font-ttf-ubuntu conda-forge/noarch::font-ttf-ubuntu-0.83-h77eed37_3 2025-05-07T20:25:29.2477016Z fontconfig conda-forge/linux-64::fontconfig-2.15.0-h7e30c49_1 2025-05-07T20:25:29.2477517Z fonts-conda-ecosy~ conda-forge/noarch::fonts-conda-ecosystem-1-0 2025-05-07T20:25:29.2478012Z fonts-conda-forge conda-forge/noarch::fonts-conda-forge-1-0 2025-05-07T20:25:29.2478472Z freetype conda-forge/linux-64::freetype-2.13.3-ha770c72_1 2025-05-07T20:25:29.2478884Z gcc conda-forge/linux-64::gcc-11.4.0-h602e360_13 2025-05-07T20:25:29.2479305Z gds-tools conda-forge/linux-64::gds-tools-1.11.1.6-h5888daf_4 2025-05-07T20:25:29.2479726Z gmp conda-forge/linux-64::gmp-6.3.0-hac33072_2 2025-05-07T20:25:29.2480095Z gxx conda-forge/linux-64::gxx-11.4.0-h602e360_13 2025-05-07T20:25:29.2480502Z keyutils conda-forge/linux-64::keyutils-1.6.1-h166bdaf_0 2025-05-07T20:25:29.2480916Z krb5 conda-forge/linux-64::krb5-1.21.3-h659f571_0 2025-05-07T20:25:29.2481320Z libcap conda-forge/linux-64::libcap-2.71-h39aace5_0 2025-05-07T20:25:29.2481761Z libcublas conda-forge/linux-64::libcublas-12.6.4.1-h5888daf_1 2025-05-07T20:25:29.2482260Z libcublas-dev conda-forge/linux-64::libcublas-dev-12.6.4.1-h5888daf_1 2025-05-07T20:25:29.2482751Z libcufft conda-forge/linux-64::libcufft-11.3.0.4-hbd13f7d_0 2025-05-07T20:25:29.2483227Z libcufft-dev conda-forge/linux-64::libcufft-dev-11.3.0.4-h5888daf_0 2025-05-07T20:25:29.2483729Z libcufile conda-forge/linux-64::libcufile-1.11.1.6-h12f29b5_4 2025-05-07T20:25:29.2484245Z libcufile-dev conda-forge/linux-64::libcufile-dev-1.11.1.6-h5888daf_4 2025-05-07T20:25:29.2484740Z libcurand conda-forge/linux-64::libcurand-10.3.7.77-hbd13f7d_0 2025-05-07T20:25:29.2485236Z libcurand-dev conda-forge/linux-64::libcurand-dev-10.3.7.77-h5888daf_0 2025-05-07T20:25:29.2485741Z libcusolver conda-forge/linux-64::libcusolver-11.7.1.2-h5888daf_1 2025-05-07T20:25:29.2486262Z libcusolver-dev conda-forge/linux-64::libcusolver-dev-11.7.1.2-h5888daf_1 2025-05-07T20:25:29.2486794Z libcusparse conda-forge/linux-64::libcusparse-12.5.4.2-hbd13f7d_0 2025-05-07T20:25:29.2487317Z libcusparse-dev conda-forge/linux-64::libcusparse-dev-12.5.4.2-h5888daf_0 2025-05-07T20:25:29.2487814Z libedit conda-forge/linux-64::libedit-3.1.20191231-he28a2e2_2 2025-05-07T20:25:29.2488267Z libexpat conda-forge/linux-64::libexpat-2.7.0-h5888daf_0 2025-05-07T20:25:29.2488726Z libfreetype conda-forge/linux-64::libfreetype-2.13.3-ha770c72_1 2025-05-07T20:25:29.2489210Z libfreetype6 conda-forge/linux-64::libfreetype6-2.13.3-h48d6fc4_1 2025-05-07T20:25:29.2489713Z libgcrypt-lib conda-forge/linux-64::libgcrypt-lib-1.11.0-hb9d3cd8_2 2025-05-07T20:25:29.2490410Z libglib conda-forge/linux-64::libglib-2.84.0-h2ff4ddf_0 2025-05-07T20:25:29.2490859Z libgpg-error conda-forge/linux-64::libgpg-error-1.55-h3f2d84a_0 2025-05-07T20:25:29.2491316Z libiconv conda-forge/linux-64::libiconv-1.18-h4ce23a2_1 2025-05-07T20:25:29.2491920Z libnl conda-forge/linux-64::libnl-3.11.0-hb9d3cd8_0 2025-05-07T20:25:29.2492454Z libnpp conda-forge/linux-64::libnpp-12.3.1.54-h5888daf_0 2025-05-07T20:25:29.2492915Z libnpp-dev conda-forge/linux-64::libnpp-dev-12.3.1.54-h5888daf_0 2025-05-07T20:25:29.2493355Z libnsl conda-forge/linux-64::libnsl-2.0.1-hd590300_0 2025-05-07T20:25:29.2493772Z libnuma conda-forge/linux-64::libnuma-2.0.18-h4ab18f5_2 2025-05-07T20:25:29.2494231Z libnvfatbin conda-forge/linux-64::libnvfatbin-12.6.77-hbd13f7d_0 2025-05-07T20:25:29.2494750Z libnvfatbin-dev conda-forge/linux-64::libnvfatbin-dev-12.6.77-h5888daf_0 2025-05-07T20:25:29.2495271Z libnvjitlink conda-forge/linux-64::libnvjitlink-12.6.85-hbd13f7d_0 2025-05-07T20:25:29.2495976Z libnvjitlink-dev conda-forge/linux-64::libnvjitlink-dev-12.6.85-h5888daf_0 2025-05-07T20:25:29.2496698Z libnvjpeg conda-forge/linux-64::libnvjpeg-12.3.3.54-h5888daf_0 2025-05-07T20:25:29.2497237Z libnvjpeg-dev conda-forge/linux-64::libnvjpeg-dev-12.3.3.54-ha770c72_0 2025-05-07T20:25:29.2497721Z libpng conda-forge/linux-64::libpng-1.6.47-h943b412_0 2025-05-07T20:25:29.2498155Z libsqlite conda-forge/linux-64::libsqlite-3.49.2-hee588c1_0 2025-05-07T20:25:29.2498616Z libsystemd0 conda-forge/linux-64::libsystemd0-256.9-h2774228_0 2025-05-07T20:25:29.2499064Z libudev1 conda-forge/linux-64::libudev1-257.4-h9a4d06a_0 2025-05-07T20:25:29.2499487Z libxcb conda-forge/linux-64::libxcb-1.17.0-h8a09558_0 2025-05-07T20:25:29.2500039Z libxkbcommon conda-forge/linux-64::libxkbcommon-1.8.0-hc4a0caf_0 2025-05-07T20:25:29.2500526Z libxkbfile conda-forge/linux-64::libxkbfile-1.1.0-h166bdaf_1 2025-05-07T20:25:29.2500961Z libxml2 conda-forge/linux-64::libxml2-2.13.5-h064dc61_0 2025-05-07T20:25:29.2501382Z libzlib conda-forge/linux-64::libzlib-1.3.1-hb9d3cd8_2 2025-05-07T20:25:29.2501797Z lz4-c conda-forge/linux-64::lz4-c-1.9.4-hcb278e6_0 2025-05-07T20:25:29.2502281Z nsight-compute conda-forge/linux-64::nsight-compute-2024.3.2.3-hb5ebaad_0 2025-05-07T20:25:29.2502748Z nspr conda-forge/linux-64::nspr-4.36-h5888daf_0 2025-05-07T20:25:29.2503121Z nss conda-forge/linux-64::nss-3.111-h159eef7_0 2025-05-07T20:25:29.2503511Z ocl-icd conda-forge/linux-64::ocl-icd-2.3.3-hb9d3cd8_0 2025-05-07T20:25:29.2503998Z opencl-headers conda-forge/linux-64::opencl-headers-2024.10.24-h5888daf_0 2025-05-07T20:25:29.2504476Z pcre2 conda-forge/linux-64::pcre2-10.44-hc749103_2 2025-05-07T20:25:29.2504934Z pthread-stubs conda-forge/linux-64::pthread-stubs-0.4-hb9d3cd8_1002 2025-05-07T20:25:29.2514378Z rdma-core conda-forge/linux-64::rdma-core-55.0-h5888daf_0 2025-05-07T20:25:29.2515059Z wayland conda-forge/linux-64::wayland-1.23.1-h3e06ad9_0 2025-05-07T20:25:29.2515536Z xcb-util conda-forge/linux-64::xcb-util-0.4.1-hb711507_2 2025-05-07T20:25:29.2516020Z xcb-util-cursor conda-forge/linux-64::xcb-util-cursor-0.1.5-hb9d3cd8_0 2025-05-07T20:25:29.2516539Z xcb-util-image conda-forge/linux-64::xcb-util-image-0.4.0-hb711507_2 2025-05-07T20:25:29.2517063Z xcb-util-keysyms conda-forge/linux-64::xcb-util-keysyms-0.4.1-hb711507_0 2025-05-07T20:25:29.2517625Z xcb-util-renderut~ conda-forge/linux-64::xcb-util-renderutil-0.3.10-hb711507_0 2025-05-07T20:25:29.2518145Z xcb-util-wm conda-forge/linux-64::xcb-util-wm-0.4.2-hb711507_0 2025-05-07T20:25:29.2518644Z xkeyboard-config conda-forge/linux-64::xkeyboard-config-2.44-hb9d3cd8_0 2025-05-07T20:25:29.2519156Z xorg-libice conda-forge/linux-64::xorg-libice-1.1.2-hb9d3cd8_0 2025-05-07T20:25:29.2519617Z xorg-libsm conda-forge/linux-64::xorg-libsm-1.2.6-he73a12e_0 2025-05-07T20:25:29.2520077Z xorg-libx11 conda-forge/linux-64::xorg-libx11-1.8.12-h4f16b4b_0 2025-05-07T20:25:29.2520710Z xorg-libxau conda-forge/linux-64::xorg-libxau-1.0.12-hb9d3cd8_0 2025-05-07T20:25:29.2521347Z xorg-libxcomposite conda-forge/linux-64::xorg-libxcomposite-0.4.6-hb9d3cd8_2 2025-05-07T20:25:29.2521921Z xorg-libxdamage conda-forge/linux-64::xorg-libxdamage-1.1.6-hb9d3cd8_0 2025-05-07T20:25:29.2522454Z xorg-libxdmcp conda-forge/linux-64::xorg-libxdmcp-1.1.5-hb9d3cd8_0 2025-05-07T20:25:29.2522952Z xorg-libxext conda-forge/linux-64::xorg-libxext-1.3.6-hb9d3cd8_0 2025-05-07T20:25:29.2523463Z xorg-libxfixes conda-forge/linux-64::xorg-libxfixes-6.0.1-hb9d3cd8_0 2025-05-07T20:25:29.2523962Z xorg-libxi conda-forge/linux-64::xorg-libxi-1.8.2-hb9d3cd8_0 2025-05-07T20:25:29.2524506Z xorg-libxrandr conda-forge/linux-64::xorg-libxrandr-1.5.4-hb9d3cd8_0 2025-05-07T20:25:29.2525035Z xorg-libxrender conda-forge/linux-64::xorg-libxrender-0.9.12-hb9d3cd8_0 2025-05-07T20:25:29.2525567Z xorg-libxtst conda-forge/linux-64::xorg-libxtst-1.2.5-hb9d3cd8_3 2025-05-07T20:25:29.2526024Z zstd conda-forge/linux-64::zstd-1.5.7-hb8e6e7a_2 2025-05-07T20:25:29.2526266Z 2025-05-07T20:25:29.2526401Z The following packages will be UPDATED: 2025-05-07T20:25:29.2526607Z 2025-05-07T20:25:29.2526877Z libuuid pkgs/main::libuuid-1.41.5-h5eee18b_0 --> conda-forge::libuuid-2.38.1-h0b41bf4_0 2025-05-07T20:25:29.2527477Z zlib pkgs/main::zlib-1.2.13-h5eee18b_1 --> conda-forge::zlib-1.3.1-hb9d3cd8_2 2025-05-07T20:25:29.2527799Z 2025-05-07T20:25:29.2528023Z The following packages will be SUPERSEDED by a higher-priority channel: 2025-05-07T20:25:29.2528337Z 2025-05-07T20:25:29.2528628Z python pkgs/main::python-3.10.16-he870216_1 --> conda-forge::python-3.10.13-hd12c33a_1_cpython 2025-05-07T20:25:29.2529250Z sqlite pkgs/main::sqlite-3.45.3-h5eee18b_0 --> conda-forge::sqlite-3.32.3-hcee41ef_1 2025-05-07T20:25:29.2529818Z tk pkgs/main::tk-8.6.14-h39e8969_0 --> conda-forge::tk-8.6.13-noxft_h4845f30_101 2025-05-07T20:25:29.2530151Z 2025-05-07T20:25:29.2530177Z 2025-05-07T20:25:29.2530181Z 2025-05-07T20:25:29.2530331Z Downloading and Extracting Packages: ...working... 2025-05-07T20:25:29.2530725Z nsight-compute-2024. | 443.1 MB | | 0% 2025-05-07T20:25:29.2530959Z 2025-05-07T20:25:29.2531353Z libcublas-12.6.4.1 | 256.2 MB | | 0%  2025-05-07T20:25:29.2531589Z 2025-05-07T20:25:29.2531593Z 2025-05-07T20:25:29.2531804Z libcufft-11.3.0.4 | 156.2 MB | | 0%  2025-05-07T20:25:29.2532049Z 2025-05-07T20:25:29.2532053Z 2025-05-07T20:25:29.2532057Z 2025-05-07T20:25:29.2532286Z libcusparse-12.5.4.2 | 118.6 MB | | 0%  2025-05-07T20:25:29.2532546Z 2025-05-07T20:25:29.2532550Z 2025-05-07T20:25:29.2532554Z 2025-05-07T20:25:29.2532558Z 2025-05-07T20:25:29.2543071Z cuda-nsight-12.6.77 | 113.2 MB | | 0%  2025-05-07T20:25:29.2543355Z 2025-05-07T20:25:29.2543366Z 2025-05-07T20:25:29.2543370Z 2025-05-07T20:25:29.2543374Z 2025-05-07T20:25:29.2543378Z 2025-05-07T20:25:29.2544662Z cuda-nvvp-12.6.80 | 109.3 MB | | 0%  2025-05-07T20:25:29.2544939Z 2025-05-07T20:25:29.2544943Z 2025-05-07T20:25:29.2544947Z 2025-05-07T20:25:29.2544951Z 2025-05-07T20:25:29.2544954Z 2025-05-07T20:25:29.2547593Z 2025-05-07T20:25:29.2548962Z libcusolver-11.7.1.2 | 95.8 MB | | 0%  2025-05-07T20:25:29.2549256Z 2025-05-07T20:25:29.2549260Z 2025-05-07T20:25:29.2549264Z 2025-05-07T20:25:29.2549267Z 2025-05-07T20:25:29.2549271Z 2025-05-07T20:25:29.2549275Z 2025-05-07T20:25:29.2549279Z 2025-05-07T20:25:29.2551090Z libnpp-12.3.1.54 | 93.4 MB | | 0%  2025-05-07T20:25:29.2551369Z 2025-05-07T20:25:29.2551372Z 2025-05-07T20:25:29.2551376Z 2025-05-07T20:25:29.2551380Z 2025-05-07T20:25:29.2551383Z 2025-05-07T20:25:29.2551387Z 2025-05-07T20:25:29.2551390Z 2025-05-07T20:25:29.2551502Z 2025-05-07T20:25:29.2553638Z cuda-nvdisasm-12.6.7 | 47.6 MB | | 0%  2025-05-07T20:25:29.2553937Z 2025-05-07T20:25:29.2554039Z 2025-05-07T20:25:29.2554043Z 2025-05-07T20:25:29.2554047Z 2025-05-07T20:25:29.2554050Z 2025-05-07T20:25:29.2554054Z 2025-05-07T20:25:29.2554057Z 2025-05-07T20:25:29.2554061Z 2025-05-07T20:25:29.2554064Z 2025-05-07T20:25:29.2554933Z libcurand-10.3.7.77 | 39.9 MB | | 0%  2025-05-07T20:25:29.2555212Z 2025-05-07T20:25:29.2555215Z 2025-05-07T20:25:29.2555224Z 2025-05-07T20:25:29.2555227Z 2025-05-07T20:25:29.2555231Z 2025-05-07T20:25:29.2555234Z 2025-05-07T20:25:29.2555238Z 2025-05-07T20:25:29.2555241Z 2025-05-07T20:25:29.2555245Z 2025-05-07T20:25:29.2555249Z 2025-05-07T20:25:29.2556193Z gds-tools-1.11.1.6 | 37.8 MB | | 0%  2025-05-07T20:25:29.2556468Z 2025-05-07T20:25:29.2556471Z 2025-05-07T20:25:29.2556475Z 2025-05-07T20:25:29.2556486Z 2025-05-07T20:25:29.2556489Z 2025-05-07T20:25:29.2556503Z 2025-05-07T20:25:29.2556507Z 2025-05-07T20:25:29.2556510Z 2025-05-07T20:25:29.2556514Z 2025-05-07T20:25:29.2556523Z 2025-05-07T20:25:29.2556527Z 2025-05-07T20:25:29.2557981Z python-3.10.13 | 24.5 MB | | 0%  2025-05-07T20:25:29.2558261Z 2025-05-07T20:25:29.2558265Z 2025-05-07T20:25:29.2558268Z 2025-05-07T20:25:29.2558272Z 2025-05-07T20:25:29.2558275Z 2025-05-07T20:25:29.2558279Z 2025-05-07T20:25:29.2558283Z 2025-05-07T20:25:29.2558286Z 2025-05-07T20:25:29.2558290Z 2025-05-07T20:25:29.2558294Z 2025-05-07T20:25:29.2558297Z 2025-05-07T20:25:29.2558301Z 2025-05-07T20:25:29.2559195Z cuda-nvcc-tools-12.6 | 23.0 MB | | 0%  2025-05-07T20:25:29.2559490Z 2025-05-07T20:25:29.2559493Z 2025-05-07T20:25:29.2559497Z 2025-05-07T20:25:29.2559501Z 2025-05-07T20:25:29.2559504Z 2025-05-07T20:25:29.2559508Z 2025-05-07T20:25:29.2559511Z 2025-05-07T20:25:29.2559521Z 2025-05-07T20:25:29.2559525Z 2025-05-07T20:25:29.2559528Z 2025-05-07T20:25:29.2559532Z 2025-05-07T20:25:29.2559535Z 2025-05-07T20:25:29.2559546Z 2025-05-07T20:25:29.2560918Z cuda-nvrtc-12.6.85 | 17.3 MB | | 0%  2025-05-07T20:25:29.2561208Z 2025-05-07T20:25:29.2561211Z 2025-05-07T20:25:29.2561215Z 2025-05-07T20:25:29.2561219Z 2025-05-07T20:25:29.2561222Z 2025-05-07T20:25:29.2561226Z 2025-05-07T20:25:29.2561229Z 2025-05-07T20:25:29.2561233Z 2025-05-07T20:25:29.2561237Z 2025-05-07T20:25:29.2561240Z 2025-05-07T20:25:29.2561244Z 2025-05-07T20:25:29.2561247Z 2025-05-07T20:25:29.2561251Z 2025-05-07T20:25:29.2562248Z 2025-05-07T20:25:29.2566183Z libnvjitlink-12.6.85 | 14.9 MB | | 0%  2025-05-07T20:25:29.2566487Z 2025-05-07T20:25:29.2566490Z 2025-05-07T20:25:29.2566494Z 2025-05-07T20:25:29.2566498Z 2025-05-07T20:25:29.2566501Z 2025-05-07T20:25:29.2566505Z 2025-05-07T20:25:29.2566515Z 2025-05-07T20:25:29.2566519Z 2025-05-07T20:25:29.2566523Z 2025-05-07T20:25:29.2566526Z 2025-05-07T20:25:29.2566537Z 2025-05-07T20:25:29.2566545Z 2025-05-07T20:25:29.2566548Z 2025-05-07T20:25:29.2566552Z 2025-05-07T20:25:29.2566555Z 2025-05-07T20:25:29.2568623Z cuda-nvcc-dev_linux- | 10.8 MB | | 0%  2025-05-07T20:25:29.2568933Z 2025-05-07T20:25:29.2568937Z 2025-05-07T20:25:29.2568941Z 2025-05-07T20:25:29.2568944Z 2025-05-07T20:25:29.2568948Z 2025-05-07T20:25:29.2568952Z 2025-05-07T20:25:29.2568955Z 2025-05-07T20:25:29.2568959Z 2025-05-07T20:25:29.2568962Z 2025-05-07T20:25:29.2568966Z 2025-05-07T20:25:29.2568969Z 2025-05-07T20:25:29.2568973Z 2025-05-07T20:25:29.2568977Z 2025-05-07T20:25:29.2568980Z 2025-05-07T20:25:29.2568984Z 2025-05-07T20:25:29.2568988Z 2025-05-07T20:25:29.2570998Z cuda-nvvm-tools-12.6 | 10.4 MB | | 0%  2025-05-07T20:25:29.2571417Z 2025-05-07T20:25:29.2571421Z 2025-05-07T20:25:29.2571425Z 2025-05-07T20:25:29.2571429Z 2025-05-07T20:25:29.2571432Z 2025-05-07T20:25:29.2571436Z 2025-05-07T20:25:29.2571521Z 2025-05-07T20:25:29.2571525Z 2025-05-07T20:25:29.2571529Z 2025-05-07T20:25:29.2571540Z 2025-05-07T20:25:29.2571543Z 2025-05-07T20:25:29.2571547Z 2025-05-07T20:25:29.2571551Z 2025-05-07T20:25:29.2571555Z 2025-05-07T20:25:29.2571559Z 2025-05-07T20:25:29.2571562Z 2025-05-07T20:25:29.2571566Z 2025-05-07T20:25:29.2572389Z cuda-sanitizer-api-1 | 8.9 MB | | 0%  2025-05-07T20:25:29.2572706Z 2025-05-07T20:25:29.2572715Z 2025-05-07T20:25:29.2572719Z 2025-05-07T20:25:29.2572723Z 2025-05-07T20:25:29.2572726Z 2025-05-07T20:25:29.2572730Z 2025-05-07T20:25:29.2572733Z 2025-05-07T20:25:29.2572737Z 2025-05-07T20:25:29.2572741Z 2025-05-07T20:25:29.2572744Z 2025-05-07T20:25:29.2572748Z 2025-05-07T20:25:29.2572751Z 2025-05-07T20:25:29.2572755Z 2025-05-07T20:25:29.2572765Z 2025-05-07T20:25:29.2572769Z 2025-05-07T20:25:29.2572772Z 2025-05-07T20:25:29.2572776Z 2025-05-07T20:25:29.2572780Z 2025-05-07T20:25:29.2574092Z cuda-nvvm-impl-12.6. | 7.7 MB | | 0%  2025-05-07T20:25:29.2574400Z 2025-05-07T20:25:29.2574404Z 2025-05-07T20:25:29.2574407Z 2025-05-07T20:25:29.2574411Z 2025-05-07T20:25:29.2574421Z 2025-05-07T20:25:29.2574425Z 2025-05-07T20:25:29.2574428Z 2025-05-07T20:25:29.2574432Z 2025-05-07T20:25:29.2574435Z 2025-05-07T20:25:29.2574439Z 2025-05-07T20:25:29.2574443Z 2025-05-07T20:25:29.2574446Z 2025-05-07T20:25:29.2574450Z 2025-05-07T20:25:29.2574453Z 2025-05-07T20:25:29.2574457Z 2025-05-07T20:25:29.2574461Z 2025-05-07T20:25:29.2574464Z 2025-05-07T20:25:29.2574468Z 2025-05-07T20:25:29.2574471Z 2025-05-07T20:25:29.3473974Z ... (more hidden) ... 2025-05-07T20:25:29.3475592Z nsight-compute-2024. | 443.1 MB | | 0% 2025-05-07T20:25:29.3477429Z 2025-05-07T20:25:29.3483985Z libcublas-12.6.4.1 | 256.2 MB | | 1%  2025-05-07T20:25:29.3484342Z 2025-05-07T20:25:29.3485192Z 2025-05-07T20:25:29.3498218Z libcufft-11.3.0.4 | 156.2 MB | | 0%  2025-05-07T20:25:29.3498593Z 2025-05-07T20:25:29.3498599Z 2025-05-07T20:25:29.3502932Z 2025-05-07T20:25:29.3509832Z libcusparse-12.5.4.2 | 118.6 MB | 2 | 3%  2025-05-07T20:25:29.3510215Z 2025-05-07T20:25:29.3510221Z 2025-05-07T20:25:29.3510226Z 2025-05-07T20:25:29.3510231Z 2025-05-07T20:25:29.4476828Z cuda-nsight-12.6.77 | 113.2 MB | | 1%  2025-05-07T20:25:29.4480208Z nsight-compute-2024. | 443.1 MB | | 1% 2025-05-07T20:25:29.4480719Z 2025-05-07T20:25:29.4486758Z libcublas-12.6.4.1 | 256.2 MB | 2 | 2%  2025-05-07T20:25:29.4487111Z 2025-05-07T20:25:29.4488379Z 2025-05-07T20:25:29.4501879Z libcufft-11.3.0.4 | 156.2 MB | 2 | 2%  2025-05-07T20:25:29.4502257Z 2025-05-07T20:25:29.4502263Z 2025-05-07T20:25:29.4502268Z 2025-05-07T20:25:29.4510897Z libcusparse-12.5.4.2 | 118.6 MB | 5 | 6%  2025-05-07T20:25:29.4511295Z 2025-05-07T20:25:29.4511300Z 2025-05-07T20:25:29.4511305Z 2025-05-07T20:25:29.4511311Z 2025-05-07T20:25:29.5478763Z cuda-nsight-12.6.77 | 113.2 MB | 4 | 4%  2025-05-07T20:25:29.5480617Z nsight-compute-2024. | 443.1 MB | 1 | 1% 2025-05-07T20:25:29.5482799Z 2025-05-07T20:25:29.5487139Z libcublas-12.6.4.1 | 256.2 MB | 3 | 4%  2025-05-07T20:25:29.5487496Z 2025-05-07T20:25:29.5488300Z 2025-05-07T20:25:29.5509186Z libcufft-11.3.0.4 | 156.2 MB | 4 | 5%  2025-05-07T20:25:29.5509559Z 2025-05-07T20:25:29.5509565Z 2025-05-07T20:25:29.5510326Z 2025-05-07T20:25:29.5521382Z libcusparse-12.5.4.2 | 118.6 MB | 8 | 9%  2025-05-07T20:25:29.5521774Z 2025-05-07T20:25:29.5521779Z 2025-05-07T20:25:29.5521785Z 2025-05-07T20:25:29.5524375Z 2025-05-07T20:25:29.6481340Z cuda-nsight-12.6.77 | 113.2 MB | 7 | 7%  2025-05-07T20:25:29.6487337Z nsight-compute-2024. | 443.1 MB | 2 | 2% 2025-05-07T20:25:29.6487741Z 2025-05-07T20:25:29.6488885Z 2025-05-07T20:25:29.6511003Z libcufft-11.3.0.4 | 156.2 MB | 6 | 7%  2025-05-07T20:25:29.6511418Z 2025-05-07T20:25:29.6511422Z 2025-05-07T20:25:29.6511426Z 2025-05-07T20:25:29.6514837Z libcusparse-12.5.4.2 | 118.6 MB | #1 | 12%  2025-05-07T20:25:29.6515353Z 2025-05-07T20:25:29.6515552Z 2025-05-07T20:25:29.6515556Z 2025-05-07T20:25:29.6515814Z 2025-05-07T20:25:29.6518310Z cuda-nsight-12.6.77 | 113.2 MB | # | 10%  2025-05-07T20:25:29.6520626Z 2025-05-07T20:25:29.7485592Z libcublas-12.6.4.1 | 256.2 MB | 5 | 5%  2025-05-07T20:25:29.7491323Z nsight-compute-2024. | 443.1 MB | 3 | 3% 2025-05-07T20:25:29.7491566Z 2025-05-07T20:25:29.7491571Z 2025-05-07T20:25:29.7514865Z libcufft-11.3.0.4 | 156.2 MB | 9 | 9%  2025-05-07T20:25:29.7515118Z 2025-05-07T20:25:29.7515123Z 2025-05-07T20:25:29.7515127Z 2025-05-07T20:25:29.7516154Z 2025-05-07T20:25:29.7547515Z cuda-nsight-12.6.77 | 113.2 MB | #3 | 14%  2025-05-07T20:25:29.7547826Z 2025-05-07T20:25:29.7667756Z libcublas-12.6.4.1 | 256.2 MB | 6 | 6%  2025-05-07T20:25:29.7668004Z 2025-05-07T20:25:29.7668008Z 2025-05-07T20:25:29.7669541Z 2025-05-07T20:25:29.8487932Z libcusparse-12.5.4.2 | 118.6 MB | #4 | 15%  2025-05-07T20:25:29.8495331Z nsight-compute-2024. | 443.1 MB | 3 | 4% 2025-05-07T20:25:29.8495574Z 2025-05-07T20:25:29.8497182Z 2025-05-07T20:25:29.8519089Z libcufft-11.3.0.4 | 156.2 MB | #1 | 11%  2025-05-07T20:25:29.8519340Z 2025-05-07T20:25:29.8519343Z 2025-05-07T20:25:29.8519347Z 2025-05-07T20:25:29.8519358Z 2025-05-07T20:25:29.8597619Z cuda-nsight-12.6.77 | 113.2 MB | #6 | 16%  2025-05-07T20:25:29.8599937Z 2025-05-07T20:25:29.8723052Z libcublas-12.6.4.1 | 256.2 MB | 7 | 8%  2025-05-07T20:25:29.8723295Z 2025-05-07T20:25:29.8723306Z 2025-05-07T20:25:29.8723310Z 2025-05-07T20:25:29.9488766Z libcusparse-12.5.4.2 | 118.6 MB | #7 | 18%  2025-05-07T20:25:29.9498036Z nsight-compute-2024. | 443.1 MB | 4 | 5% 2025-05-07T20:25:29.9498284Z 2025-05-07T20:25:29.9501252Z 2025-05-07T20:25:29.9532096Z libcufft-11.3.0.4 | 156.2 MB | #3 | 14%  2025-05-07T20:25:29.9532348Z 2025-05-07T20:25:29.9532352Z 2025-05-07T20:25:29.9532356Z 2025-05-07T20:25:29.9536514Z 2025-05-07T20:25:29.9597633Z cuda-nsight-12.6.77 | 113.2 MB | #9 | 19%  2025-05-07T20:25:29.9597953Z 2025-05-07T20:25:29.9787708Z libcublas-12.6.4.1 | 256.2 MB | 9 | 9%  2025-05-07T20:25:29.9787959Z 2025-05-07T20:25:29.9788175Z 2025-05-07T20:25:29.9789348Z 2025-05-07T20:25:30.0491286Z libcusparse-12.5.4.2 | 118.6 MB | ## | 20%  2025-05-07T20:25:30.0499118Z nsight-compute-2024. | 443.1 MB | 5 | 5% 2025-05-07T20:25:30.0499373Z 2025-05-07T20:25:30.0499738Z 2025-05-07T20:25:30.0538060Z libcufft-11.3.0.4 | 156.2 MB | #5 | 16%  2025-05-07T20:25:30.0538337Z 2025-05-07T20:25:30.0538342Z 2025-05-07T20:25:30.0538346Z 2025-05-07T20:25:30.0540783Z 2025-05-07T20:25:30.0600666Z cuda-nsight-12.6.77 | 113.2 MB | ##2 | 22%  2025-05-07T20:25:30.0601176Z 2025-05-07T20:25:30.0866746Z libcublas-12.6.4.1 | 256.2 MB | # | 10%  2025-05-07T20:25:30.0867153Z 2025-05-07T20:25:30.0867159Z 2025-05-07T20:25:30.0869486Z 2025-05-07T20:25:30.1492318Z libcusparse-12.5.4.2 | 118.6 MB | ##3 | 23%  2025-05-07T20:25:30.1502876Z nsight-compute-2024. | 443.1 MB | 6 | 6% 2025-05-07T20:25:30.1503185Z 2025-05-07T20:25:30.1503191Z 2025-05-07T20:25:30.1563019Z libcufft-11.3.0.4 | 156.2 MB | #8 | 18%  2025-05-07T20:25:30.1563540Z 2025-05-07T20:25:30.1563546Z 2025-05-07T20:25:30.1563552Z 2025-05-07T20:25:30.1563723Z 2025-05-07T20:25:30.1872072Z cuda-nsight-12.6.77 | 113.2 MB | ##5 | 25%  2025-05-07T20:25:30.1872378Z 2025-05-07T20:25:30.1872382Z 2025-05-07T20:25:30.1872949Z 2025-05-07T20:25:30.2492909Z libcusparse-12.5.4.2 | 118.6 MB | ##6 | 26%  2025-05-07T20:25:30.2494560Z 2025-05-07T20:25:30.2500434Z libcublas-12.6.4.1 | 256.2 MB | #1 | 12%  2025-05-07T20:25:30.2504869Z nsight-compute-2024. | 443.1 MB | 7 | 7% 2025-05-07T20:25:30.2505134Z 2025-05-07T20:25:30.2505138Z 2025-05-07T20:25:30.2595237Z libcufft-11.3.0.4 | 156.2 MB | ## | 21%  2025-05-07T20:25:30.2595505Z 2025-05-07T20:25:30.2595509Z 2025-05-07T20:25:30.2595513Z 2025-05-07T20:25:30.2595516Z 2025-05-07T20:25:30.2873427Z cuda-nsight-12.6.77 | 113.2 MB | ##8 | 28%  2025-05-07T20:25:30.2873784Z 2025-05-07T20:25:30.2873788Z 2025-05-07T20:25:30.2874487Z 2025-05-07T20:25:30.3496619Z libcusparse-12.5.4.2 | 118.6 MB | ##9 | 29%  2025-05-07T20:25:30.3496915Z 2025-05-07T20:25:30.3550114Z libcublas-12.6.4.1 | 256.2 MB | #3 | 13%  2025-05-07T20:25:30.3595908Z nsight-compute-2024. | 443.1 MB | 7 | 8% 2025-05-07T20:25:30.3596227Z 2025-05-07T20:25:30.3596233Z 2025-05-07T20:25:30.3596238Z 2025-05-07T20:25:30.3599907Z 2025-05-07T20:25:30.3610665Z cuda-nsight-12.6.77 | 113.2 MB | ###1 | 31%  2025-05-07T20:25:30.3610961Z 2025-05-07T20:25:30.3613287Z 2025-05-07T20:25:30.3873325Z libcufft-11.3.0.4 | 156.2 MB | ##3 | 23%  2025-05-07T20:25:30.3873640Z 2025-05-07T20:25:30.3873646Z 2025-05-07T20:25:30.3874874Z 2025-05-07T20:25:30.4502476Z libcusparse-12.5.4.2 | 118.6 MB | ###2 | 32%  2025-05-07T20:25:30.4504025Z 2025-05-07T20:25:30.4552232Z libcublas-12.6.4.1 | 256.2 MB | #4 | 15%  2025-05-07T20:25:30.4597482Z nsight-compute-2024. | 443.1 MB | 8 | 9% 2025-05-07T20:25:30.4597922Z 2025-05-07T20:25:30.4597930Z 2025-05-07T20:25:30.4597935Z 2025-05-07T20:25:30.4598675Z 2025-05-07T20:25:30.4735948Z cuda-nsight-12.6.77 | 113.2 MB | ###4 | 34%  2025-05-07T20:25:30.4736233Z 2025-05-07T20:25:30.4736238Z 2025-05-07T20:25:30.4920068Z libcufft-11.3.0.4 | 156.2 MB | ##5 | 26%  2025-05-07T20:25:30.4920417Z 2025-05-07T20:25:30.4920423Z 2025-05-07T20:25:30.4922714Z 2025-05-07T20:25:30.5503801Z libcusparse-12.5.4.2 | 118.6 MB | ###4 | 35%  2025-05-07T20:25:30.5505156Z 2025-05-07T20:25:30.5601307Z libcublas-12.6.4.1 | 256.2 MB | #5 | 16%  2025-05-07T20:25:30.5601726Z 2025-05-07T20:25:30.5601732Z 2025-05-07T20:25:30.5601738Z 2025-05-07T20:25:30.5601743Z 2025-05-07T20:25:30.5638984Z cuda-nsight-12.6.77 | 113.2 MB | ###7 | 38%  2025-05-07T20:25:30.5742330Z nsight-compute-2024. | 443.1 MB | 9 | 10% 2025-05-07T20:25:30.5742712Z 2025-05-07T20:25:30.5744212Z 2025-05-07T20:25:30.5920179Z libcufft-11.3.0.4 | 156.2 MB | ##7 | 28%  2025-05-07T20:25:30.5920464Z 2025-05-07T20:25:30.5920485Z 2025-05-07T20:25:30.5920491Z 2025-05-07T20:25:30.6507363Z libcusparse-12.5.4.2 | 118.6 MB | ###7 | 38%  2025-05-07T20:25:30.6508095Z 2025-05-07T20:25:30.6601522Z libcublas-12.6.4.1 | 256.2 MB | #7 | 17%  2025-05-07T20:25:30.6601857Z 2025-05-07T20:25:30.6601861Z 2025-05-07T20:25:30.6601865Z 2025-05-07T20:25:30.6601869Z 2025-05-07T20:25:30.6640722Z cuda-nsight-12.6.77 | 113.2 MB | #### | 41%  2025-05-07T20:25:30.6746185Z nsight-compute-2024. | 443.1 MB | # | 10% 2025-05-07T20:25:30.6746488Z 2025-05-07T20:25:30.6747767Z 2025-05-07T20:25:30.6926072Z libcufft-11.3.0.4 | 156.2 MB | ### | 30%  2025-05-07T20:25:30.6926357Z 2025-05-07T20:25:30.6926361Z 2025-05-07T20:25:30.6926365Z 2025-05-07T20:25:30.7509901Z libcusparse-12.5.4.2 | 118.6 MB | ####1 | 41%  2025-05-07T20:25:30.7510479Z 2025-05-07T20:25:30.7630927Z libcublas-12.6.4.1 | 256.2 MB | #8 | 19%  2025-05-07T20:25:30.7631261Z 2025-05-07T20:25:30.7631537Z 2025-05-07T20:25:30.7631543Z 2025-05-07T20:25:30.7634497Z 2025-05-07T20:25:30.7641362Z cuda-nsight-12.6.77 | 113.2 MB | ####4 | 44%  2025-05-07T20:25:30.7808159Z nsight-compute-2024. | 443.1 MB | #1 | 11% 2025-05-07T20:25:30.7808423Z 2025-05-07T20:25:30.7808427Z 2025-05-07T20:25:30.7926746Z libcufft-11.3.0.4 | 156.2 MB | ###2 | 32%  2025-05-07T20:25:30.7927025Z 2025-05-07T20:25:30.7927029Z 2025-05-07T20:25:30.7927497Z 2025-05-07T20:25:30.8510632Z libcusparse-12.5.4.2 | 118.6 MB | ####4 | 44%  2025-05-07T20:25:30.8512093Z 2025-05-07T20:25:30.8645325Z libcublas-12.6.4.1 | 256.2 MB | ## | 20%  2025-05-07T20:25:30.8701141Z nsight-compute-2024. | 443.1 MB | #2 | 12% 2025-05-07T20:25:30.8701514Z 2025-05-07T20:25:30.8701520Z 2025-05-07T20:25:30.8701545Z 2025-05-07T20:25:30.8701550Z 2025-05-07T20:25:30.8809727Z cuda-nsight-12.6.77 | 113.2 MB | ####7 | 47%  2025-05-07T20:25:30.8810029Z 2025-05-07T20:25:30.8810034Z 2025-05-07T20:25:30.8930655Z libcufft-11.3.0.4 | 156.2 MB | ###4 | 35%  2025-05-07T20:25:30.8930935Z 2025-05-07T20:25:30.8930939Z 2025-05-07T20:25:30.8930943Z 2025-05-07T20:25:30.9517477Z libcusparse-12.5.4.2 | 118.6 MB | ####6 | 47%  2025-05-07T20:25:30.9518364Z 2025-05-07T20:25:30.9654426Z libcublas-12.6.4.1 | 256.2 MB | ##1 | 22%  2025-05-07T20:25:30.9702932Z nsight-compute-2024. | 443.1 MB | #2 | 13% 2025-05-07T20:25:30.9703306Z 2025-05-07T20:25:30.9703312Z 2025-05-07T20:25:30.9703317Z 2025-05-07T20:25:30.9703322Z 2025-05-07T20:25:30.9816022Z cuda-nsight-12.6.77 | 113.2 MB | ##### | 50%  2025-05-07T20:25:30.9816346Z 2025-05-07T20:25:30.9816350Z 2025-05-07T20:25:30.9933621Z libcufft-11.3.0.4 | 156.2 MB | ###7 | 37%  2025-05-07T20:25:30.9933915Z 2025-05-07T20:25:30.9933919Z 2025-05-07T20:25:30.9933923Z 2025-05-07T20:25:31.0518782Z libcusparse-12.5.4.2 | 118.6 MB | ##### | 50%  2025-05-07T20:25:31.0519599Z 2025-05-07T20:25:31.0655499Z libcublas-12.6.4.1 | 256.2 MB | ##3 | 23%  2025-05-07T20:25:31.0816412Z nsight-compute-2024. | 443.1 MB | #3 | 14% 2025-05-07T20:25:31.0816675Z 2025-05-07T20:25:31.0816679Z 2025-05-07T20:25:31.0933989Z libcufft-11.3.0.4 | 156.2 MB | ###9 | 40%  2025-05-07T20:25:31.0934245Z 2025-05-07T20:25:31.0934258Z 2025-05-07T20:25:31.0935468Z 2025-05-07T20:25:31.1655920Z libcusparse-12.5.4.2 | 118.6 MB | #####3 | 54%  2025-05-07T20:25:31.1819999Z nsight-compute-2024. | 443.1 MB | #4 | 15% 2025-05-07T20:25:31.1820270Z 2025-05-07T20:25:31.1820274Z 2025-05-07T20:25:31.1935804Z libcufft-11.3.0.4 | 156.2 MB | ####3 | 43%  2025-05-07T20:25:31.1936062Z 2025-05-07T20:25:31.1936067Z 2025-05-07T20:25:31.1936087Z 2025-05-07T20:25:31.2656803Z libcusparse-12.5.4.2 | 118.6 MB | #####8 | 58%  2025-05-07T20:25:31.2699245Z nsight-compute-2024. | 443.1 MB | #6 | 16% 2025-05-07T20:25:31.2699635Z 2025-05-07T20:25:31.2699641Z 2025-05-07T20:25:31.2699646Z 2025-05-07T20:25:31.2699651Z 2025-05-07T20:25:31.2725344Z cuda-nsight-12.6.77 | 113.2 MB | #####3 | 53%  2025-05-07T20:25:31.2727147Z 2025-05-07T20:25:31.2821660Z libcublas-12.6.4.1 | 256.2 MB | ##4 | 25%  2025-05-07T20:25:31.2821969Z 2025-05-07T20:25:31.2821976Z 2025-05-07T20:25:31.2937578Z libcufft-11.3.0.4 | 156.2 MB | ####6 | 47%  2025-05-07T20:25:31.2937918Z 2025-05-07T20:25:31.2937922Z 2025-05-07T20:25:31.2937926Z 2025-05-07T20:25:31.3701563Z libcusparse-12.5.4.2 | 118.6 MB | ######2 | 63%  2025-05-07T20:25:31.3701962Z 2025-05-07T20:25:31.3701975Z 2025-05-07T20:25:31.3701979Z 2025-05-07T20:25:31.3703846Z 2025-05-07T20:25:31.3856837Z cuda-nsight-12.6.77 | 113.2 MB | #####6 | 56%  2025-05-07T20:25:31.3923380Z nsight-compute-2024. | 443.1 MB | #7 | 17% 2025-05-07T20:25:31.3923766Z 2025-05-07T20:25:31.4036767Z libcublas-12.6.4.1 | 256.2 MB | ##5 | 26%  2025-05-07T20:25:31.4037041Z 2025-05-07T20:25:31.4037045Z 2025-05-07T20:25:31.4316514Z libcufft-11.3.0.4 | 156.2 MB | ####9 | 50%  2025-05-07T20:25:31.4316876Z 2025-05-07T20:25:31.4316882Z 2025-05-07T20:25:31.4316886Z 2025-05-07T20:25:31.4705449Z libcusparse-12.5.4.2 | 118.6 MB | ######6 | 67%  2025-05-07T20:25:31.4705746Z 2025-05-07T20:25:31.4705750Z 2025-05-07T20:25:31.4705754Z 2025-05-07T20:25:31.4706344Z 2025-05-07T20:25:31.4927992Z cuda-nsight-12.6.77 | 113.2 MB | #####9 | 59%  2025-05-07T20:25:31.4930698Z 2025-05-07T20:25:31.5054134Z libcublas-12.6.4.1 | 256.2 MB | ##7 | 27%  2025-05-07T20:25:31.5224774Z nsight-compute-2024. | 443.1 MB | #8 | 18% 2025-05-07T20:25:31.5225058Z 2025-05-07T20:25:31.5227749Z 2025-05-07T20:25:31.5602587Z libcufft-11.3.0.4 | 156.2 MB | #####2 | 52%  2025-05-07T20:25:31.5602982Z 2025-05-07T20:25:31.5603004Z 2025-05-07T20:25:31.5605819Z 2025-05-07T20:25:31.5707384Z libcusparse-12.5.4.2 | 118.6 MB | ####### | 70%  2025-05-07T20:25:31.5707693Z 2025-05-07T20:25:31.5707698Z 2025-05-07T20:25:31.5707702Z 2025-05-07T20:25:31.5707706Z 2025-05-07T20:25:31.5928635Z cuda-nsight-12.6.77 | 113.2 MB | ######2 | 63%  2025-05-07T20:25:31.5929040Z 2025-05-07T20:25:31.6152316Z libcublas-12.6.4.1 | 256.2 MB | ##8 | 29%  2025-05-07T20:25:31.6352624Z nsight-compute-2024. | 443.1 MB | #9 | 19% 2025-05-07T20:25:31.6352884Z 2025-05-07T20:25:31.6354368Z 2025-05-07T20:25:31.6711517Z libcufft-11.3.0.4 | 156.2 MB | #####5 | 55%  2025-05-07T20:25:31.6711798Z 2025-05-07T20:25:31.6711802Z 2025-05-07T20:25:31.6711806Z 2025-05-07T20:25:31.6712187Z 2025-05-07T20:25:31.6814275Z cuda-nsight-12.6.77 | 113.2 MB | ######5 | 66%  2025-05-07T20:25:31.6814660Z 2025-05-07T20:25:31.6814666Z 2025-05-07T20:25:31.6816196Z 2025-05-07T20:25:31.6932791Z libcusparse-12.5.4.2 | 118.6 MB | #######3 | 74%  2025-05-07T20:25:31.6933117Z 2025-05-07T20:25:31.7319142Z libcublas-12.6.4.1 | 256.2 MB | ### | 30%  2025-05-07T20:25:31.7530211Z nsight-compute-2024. | 443.1 MB | ## | 20% 2025-05-07T20:25:31.7530477Z 2025-05-07T20:25:31.7530512Z 2025-05-07T20:25:31.7716207Z libcufft-11.3.0.4 | 156.2 MB | #####7 | 58%  2025-05-07T20:25:31.7716546Z 2025-05-07T20:25:31.7716551Z 2025-05-07T20:25:31.7716557Z 2025-05-07T20:25:31.7721013Z 2025-05-07T20:25:31.7935562Z cuda-nsight-12.6.77 | 113.2 MB | ######8 | 69%  2025-05-07T20:25:31.7935915Z 2025-05-07T20:25:31.7993820Z libcublas-12.6.4.1 | 256.2 MB | ###1 | 31%  2025-05-07T20:25:31.7994173Z 2025-05-07T20:25:31.7994177Z 2025-05-07T20:25:31.7996576Z 2025-05-07T20:25:31.8373329Z libcusparse-12.5.4.2 | 118.6 MB | #######6 | 77%  2025-05-07T20:25:31.8595298Z nsight-compute-2024. | 443.1 MB | ## | 21% 2025-05-07T20:25:31.8595580Z 2025-05-07T20:25:31.8596216Z 2025-05-07T20:25:31.8756790Z libcufft-11.3.0.4 | 156.2 MB | ###### | 60%  2025-05-07T20:25:31.8757153Z 2025-05-07T20:25:31.8757159Z 2025-05-07T20:25:31.8757165Z 2025-05-07T20:25:31.8757170Z 2025-05-07T20:25:31.8935761Z cuda-nsight-12.6.77 | 113.2 MB | #######1 | 72%  2025-05-07T20:25:31.8936137Z 2025-05-07T20:25:31.9088229Z libcublas-12.6.4.1 | 256.2 MB | ###2 | 33%  2025-05-07T20:25:31.9088532Z 2025-05-07T20:25:31.9088536Z 2025-05-07T20:25:31.9089037Z 2025-05-07T20:25:31.9473640Z libcusparse-12.5.4.2 | 118.6 MB | #######9 | 80%  2025-05-07T20:25:31.9627980Z nsight-compute-2024. | 443.1 MB | ##1 | 22% 2025-05-07T20:25:31.9628236Z 2025-05-07T20:25:31.9628241Z 2025-05-07T20:25:31.9758354Z libcufft-11.3.0.4 | 156.2 MB | ######2 | 62%  2025-05-07T20:25:31.9759015Z 2025-05-07T20:25:31.9759022Z 2025-05-07T20:25:31.9759027Z 2025-05-07T20:25:31.9759033Z 2025-05-07T20:25:31.9937190Z cuda-nsight-12.6.77 | 113.2 MB | #######4 | 75%  2025-05-07T20:25:31.9937513Z 2025-05-07T20:25:32.0089551Z libcublas-12.6.4.1 | 256.2 MB | ###4 | 34%  2025-05-07T20:25:32.0090034Z 2025-05-07T20:25:32.0090040Z 2025-05-07T20:25:32.0090482Z 2025-05-07T20:25:32.0518908Z libcusparse-12.5.4.2 | 118.6 MB | ########2 | 83%  2025-05-07T20:25:32.0727189Z nsight-compute-2024. | 443.1 MB | ##2 | 23% 2025-05-07T20:25:32.0727506Z 2025-05-07T20:25:32.0727511Z 2025-05-07T20:25:32.0761157Z libcufft-11.3.0.4 | 156.2 MB | ######4 | 65%  2025-05-07T20:25:32.0761420Z 2025-05-07T20:25:32.0761424Z 2025-05-07T20:25:32.0761428Z 2025-05-07T20:25:32.0762110Z 2025-05-07T20:25:32.0937932Z cuda-nsight-12.6.77 | 113.2 MB | #######7 | 78%  2025-05-07T20:25:32.0938245Z 2025-05-07T20:25:32.1145385Z libcublas-12.6.4.1 | 256.2 MB | ###5 | 35%  2025-05-07T20:25:32.1145654Z 2025-05-07T20:25:32.1145658Z 2025-05-07T20:25:32.1145943Z 2025-05-07T20:25:32.1519463Z libcusparse-12.5.4.2 | 118.6 MB | ########5 | 86%  2025-05-07T20:25:32.1752712Z nsight-compute-2024. | 443.1 MB | ##3 | 24% 2025-05-07T20:25:32.1753102Z 2025-05-07T20:25:32.1754437Z 2025-05-07T20:25:32.1762417Z libcufft-11.3.0.4 | 156.2 MB | ######7 | 67%  2025-05-07T20:25:32.1762685Z 2025-05-07T20:25:32.1762689Z 2025-05-07T20:25:32.1762693Z 2025-05-07T20:25:32.1764033Z 2025-05-07T20:25:32.1940893Z cuda-nsight-12.6.77 | 113.2 MB | ########1 | 81%  2025-05-07T20:25:32.1941255Z 2025-05-07T20:25:32.2146324Z libcublas-12.6.4.1 | 256.2 MB | ###6 | 37%  2025-05-07T20:25:32.2146628Z 2025-05-07T20:25:32.2146639Z 2025-05-07T20:25:32.2147182Z 2025-05-07T20:25:32.2535193Z libcusparse-12.5.4.2 | 118.6 MB | ########8 | 89%  2025-05-07T20:25:32.2765382Z nsight-compute-2024. | 443.1 MB | ##4 | 24% 2025-05-07T20:25:32.2765649Z 2025-05-07T20:25:32.2765654Z 2025-05-07T20:25:32.2765658Z 2025-05-07T20:25:32.2765677Z 2025-05-07T20:25:32.2825257Z cuda-nsight-12.6.77 | 113.2 MB | ########4 | 84%  2025-05-07T20:25:32.2825572Z 2025-05-07T20:25:32.2827185Z 2025-05-07T20:25:32.3033340Z libcufft-11.3.0.4 | 156.2 MB | ######9 | 69%  2025-05-07T20:25:32.3033656Z 2025-05-07T20:25:32.3187052Z libcublas-12.6.4.1 | 256.2 MB | ###8 | 38%  2025-05-07T20:25:32.3187450Z 2025-05-07T20:25:32.3187456Z 2025-05-07T20:25:32.3192402Z 2025-05-07T20:25:32.3599511Z libcusparse-12.5.4.2 | 118.6 MB | #########1 | 92%  2025-05-07T20:25:32.3766990Z nsight-compute-2024. | 443.1 MB | ##5 | 25% 2025-05-07T20:25:32.3767291Z 2025-05-07T20:25:32.3767297Z 2025-05-07T20:25:32.3767302Z 2025-05-07T20:25:32.3769664Z 2025-05-07T20:25:32.3847675Z cuda-nsight-12.6.77 | 113.2 MB | ########7 | 87%  2025-05-07T20:25:32.3847988Z 2025-05-07T20:25:32.3850579Z 2025-05-07T20:25:32.4033991Z libcufft-11.3.0.4 | 156.2 MB | #######1 | 72%  2025-05-07T20:25:32.4034281Z 2025-05-07T20:25:32.4187588Z libcublas-12.6.4.1 | 256.2 MB | ###9 | 39%  2025-05-07T20:25:32.4187857Z 2025-05-07T20:25:32.4187861Z 2025-05-07T20:25:32.4193330Z 2025-05-07T20:25:32.4768823Z libcusparse-12.5.4.2 | 118.6 MB | #########4 | 95%  2025-05-07T20:25:32.4769118Z 2025-05-07T20:25:32.4769132Z 2025-05-07T20:25:32.4769136Z 2025-05-07T20:25:32.4770488Z 2025-05-07T20:25:32.4791568Z cuda-nsight-12.6.77 | 113.2 MB | ######### | 91%  2025-05-07T20:25:32.4850457Z nsight-compute-2024. | 443.1 MB | ##6 | 26% 2025-05-07T20:25:32.4850828Z 2025-05-07T20:25:32.4853633Z 2025-05-07T20:25:32.5034562Z libcufft-11.3.0.4 | 156.2 MB | #######4 | 74%  2025-05-07T20:25:32.5034937Z 2025-05-07T20:25:32.5189066Z libcublas-12.6.4.1 | 256.2 MB | #### | 41%  2025-05-07T20:25:32.5189606Z 2025-05-07T20:25:32.5189610Z 2025-05-07T20:25:32.5191808Z 2025-05-07T20:25:32.5793263Z libcusparse-12.5.4.2 | 118.6 MB | #########8 | 98%  2025-05-07T20:25:32.5798107Z nsight-compute-2024. | 443.1 MB | ##6 | 27% 2025-05-07T20:25:32.5798352Z 2025-05-07T20:25:32.5798725Z 2025-05-07T20:25:32.5798729Z 2025-05-07T20:25:32.5798821Z 2025-05-07T20:25:32.5852636Z cuda-nsight-12.6.77 | 113.2 MB | #########3 | 94%  2025-05-07T20:25:32.5852922Z 2025-05-07T20:25:32.5854141Z 2025-05-07T20:25:32.6038332Z libcufft-11.3.0.4 | 156.2 MB | #######6 | 76%  2025-05-07T20:25:32.6038587Z 2025-05-07T20:25:32.6795088Z libcublas-12.6.4.1 | 256.2 MB | ####2 | 42%  2025-05-07T20:25:32.6854750Z nsight-compute-2024. | 443.1 MB | ##7 | 28% 2025-05-07T20:25:32.6855025Z 2025-05-07T20:25:32.6856467Z 2025-05-07T20:25:32.6892897Z libcufft-11.3.0.4 | 156.2 MB | #######8 | 79%  2025-05-07T20:25:32.6893167Z 2025-05-07T20:25:32.6893193Z 2025-05-07T20:25:32.6893196Z 2025-05-07T20:25:32.6893200Z 2025-05-07T20:25:32.7041202Z cuda-nsight-12.6.77 | 113.2 MB | #########6 | 97%  2025-05-07T20:25:32.7041633Z 2025-05-07T20:25:32.7798643Z libcublas-12.6.4.1 | 256.2 MB | ####3 | 44%  2025-05-07T20:25:32.7858406Z nsight-compute-2024. | 443.1 MB | ##8 | 29% 2025-05-07T20:25:32.7858770Z 2025-05-07T20:25:32.7860232Z 2025-05-07T20:25:32.8042753Z libcufft-11.3.0.4 | 156.2 MB | ########1 | 81%  2025-05-07T20:25:32.8045541Z 2025-05-07T20:25:32.8805801Z libcublas-12.6.4.1 | 256.2 MB | ####5 | 45%  2025-05-07T20:25:32.8862602Z nsight-compute-2024. | 443.1 MB | ##9 | 30% 2025-05-07T20:25:32.8862849Z 2025-05-07T20:25:32.8863152Z 2025-05-07T20:25:32.9044102Z libcufft-11.3.0.4 | 156.2 MB | ########4 | 84%  2025-05-07T20:25:32.9044661Z 2025-05-07T20:25:32.9808911Z libcublas-12.6.4.1 | 256.2 MB | ####6 | 47%  2025-05-07T20:25:32.9950301Z nsight-compute-2024. | 443.1 MB | ### | 31% 2025-05-07T20:25:32.9950587Z 2025-05-07T20:25:32.9951934Z 2025-05-07T20:25:33.0045299Z libcufft-11.3.0.4 | 156.2 MB | ########6 | 87%  2025-05-07T20:25:33.0048578Z 2025-05-07T20:25:33.0817305Z libcublas-12.6.4.1 | 256.2 MB | ####8 | 49%  2025-05-07T20:25:33.0950944Z nsight-compute-2024. | 443.1 MB | ###1 | 32% 2025-05-07T20:25:33.0951290Z 2025-05-07T20:25:33.0953153Z 2025-05-07T20:25:33.1048450Z libcufft-11.3.0.4 | 156.2 MB | ########9 | 89%  2025-05-07T20:25:33.1052538Z 2025-05-07T20:25:33.1819917Z libcublas-12.6.4.1 | 256.2 MB | ##### | 50%  2025-05-07T20:25:33.1953112Z nsight-compute-2024. | 443.1 MB | ###2 | 33% 2025-05-07T20:25:33.1953472Z 2025-05-07T20:25:33.1955627Z 2025-05-07T20:25:33.2053383Z libcufft-11.3.0.4 | 156.2 MB | #########2 | 92%  2025-05-07T20:25:33.2057449Z 2025-05-07T20:25:33.2820279Z libcublas-12.6.4.1 | 256.2 MB | #####1 | 52%  2025-05-07T20:25:33.2956248Z nsight-compute-2024. | 443.1 MB | ###3 | 34% 2025-05-07T20:25:33.2956629Z 2025-05-07T20:25:33.2958537Z 2025-05-07T20:25:33.3055607Z libcufft-11.3.0.4 | 156.2 MB | #########4 | 95%  2025-05-07T20:25:33.3057931Z 2025-05-07T20:25:33.3824324Z libcublas-12.6.4.1 | 256.2 MB | #####3 | 53%  2025-05-07T20:25:33.3958257Z nsight-compute-2024. | 443.1 MB | ###4 | 34% 2025-05-07T20:25:33.3958604Z 2025-05-07T20:25:33.3960365Z 2025-05-07T20:25:33.4057535Z libcufft-11.3.0.4 | 156.2 MB | #########7 | 97%  2025-05-07T20:25:33.4058633Z 2025-05-07T20:25:33.4828945Z libcublas-12.6.4.1 | 256.2 MB | #####4 | 55%  2025-05-07T20:25:33.5137899Z nsight-compute-2024. | 443.1 MB | ###5 | 35% 2025-05-07T20:25:33.5138276Z 2025-05-07T20:25:33.5832580Z libcublas-12.6.4.1 | 256.2 MB | #####6 | 57%  2025-05-07T20:25:33.6138442Z nsight-compute-2024. | 443.1 MB | ###6 | 36% 2025-05-07T20:25:33.6138794Z 2025-05-07T20:25:33.6832837Z libcublas-12.6.4.1 | 256.2 MB | #####8 | 58%  2025-05-07T20:25:33.7140163Z nsight-compute-2024. | 443.1 MB | ###7 | 37% 2025-05-07T20:25:33.7140627Z 2025-05-07T20:25:33.7840599Z libcublas-12.6.4.1 | 256.2 MB | ###### | 60%  2025-05-07T20:25:33.8142229Z nsight-compute-2024. | 443.1 MB | ###8 | 38% 2025-05-07T20:25:33.8144538Z 2025-05-07T20:25:33.8842988Z libcublas-12.6.4.1 | 256.2 MB | ######2 | 62%  2025-05-07T20:25:33.9143219Z nsight-compute-2024. | 443.1 MB | ###9 | 39% 2025-05-07T20:25:33.9143812Z 2025-05-07T20:25:33.9849584Z libcublas-12.6.4.1 | 256.2 MB | ######4 | 64%  2025-05-07T20:25:34.0143536Z nsight-compute-2024. | 443.1 MB | #### | 40% 2025-05-07T20:25:34.0145239Z 2025-05-07T20:25:34.0853590Z libcublas-12.6.4.1 | 256.2 MB | ######5 | 66%  2025-05-07T20:25:34.1181275Z nsight-compute-2024. | 443.1 MB | ####1 | 41% 2025-05-07T20:25:34.1181636Z 2025-05-07T20:25:34.1857429Z libcublas-12.6.4.1 | 256.2 MB | ######7 | 68%  2025-05-07T20:25:34.2182199Z nsight-compute-2024. | 443.1 MB | ####2 | 43% 2025-05-07T20:25:34.2184517Z 2025-05-07T20:25:34.2858922Z libcublas-12.6.4.1 | 256.2 MB | ######9 | 69%  2025-05-07T20:25:34.3188310Z nsight-compute-2024. | 443.1 MB | ####3 | 44% 2025-05-07T20:25:34.3188754Z 2025-05-07T20:25:34.3861615Z libcublas-12.6.4.1 | 256.2 MB | #######1 | 71%  2025-05-07T20:25:34.4192383Z nsight-compute-2024. | 443.1 MB | ####4 | 45% 2025-05-07T20:25:34.4195165Z 2025-05-07T20:25:34.4865501Z libcublas-12.6.4.1 | 256.2 MB | #######3 | 73%  2025-05-07T20:25:34.5213327Z nsight-compute-2024. | 443.1 MB | ####5 | 46% 2025-05-07T20:25:34.5213676Z 2025-05-07T20:25:34.5955680Z libcublas-12.6.4.1 | 256.2 MB | #######4 | 75%  2025-05-07T20:25:34.6214464Z nsight-compute-2024. | 443.1 MB | ####7 | 47% 2025-05-07T20:25:34.6214818Z 2025-05-07T20:25:34.7216631Z libcublas-12.6.4.1 | 256.2 MB | #######6 | 77%  2025-05-07T20:25:34.7217022Z 2025-05-07T20:25:34.7807100Z libcublas-12.6.4.1 | 256.2 MB | #######9 | 79%  2025-05-07T20:25:34.8219088Z nsight-compute-2024. | 443.1 MB | ####8 | 48% 2025-05-07T20:25:34.8219434Z 2025-05-07T20:25:34.8911308Z libcublas-12.6.4.1 | 256.2 MB | ########1 | 81%  2025-05-07T20:25:34.9200199Z nsight-compute-2024. | 443.1 MB | ####9 | 49% 2025-05-07T20:25:34.9200546Z 2025-05-07T20:25:34.9200668Z 2025-05-07T20:25:34.9200675Z 2025-05-07T20:25:34.9200732Z 2025-05-07T20:25:34.9201234Z cuda-nsight-12.6.77 | 113.2 MB | ########## | 100%  2025-05-07T20:25:34.9201605Z 2025-05-07T20:25:34.9201611Z 2025-05-07T20:25:34.9201635Z 2025-05-07T20:25:34.9201641Z 2025-05-07T20:25:34.9224202Z cuda-nsight-12.6.77 | 113.2 MB | ########## | 100%  2025-05-07T20:25:34.9224580Z 2025-05-07T20:25:34.9899019Z libcublas-12.6.4.1 | 256.2 MB | ########3 | 83%  2025-05-07T20:25:34.9899379Z 2025-05-07T20:25:34.9899385Z 2025-05-07T20:25:34.9899423Z 2025-05-07T20:25:34.9899463Z 2025-05-07T20:25:34.9899500Z 2025-05-07T20:25:34.9912093Z cuda-nvvp-12.6.80 | 109.3 MB | | 0%  2025-05-07T20:25:35.0610697Z nsight-compute-2024. | 443.1 MB | ####9 | 50% 2025-05-07T20:25:35.0614153Z 2025-05-07T20:25:35.0899066Z libcublas-12.6.4.1 | 256.2 MB | ########5 | 86%  2025-05-07T20:25:35.0899437Z 2025-05-07T20:25:35.0899714Z 2025-05-07T20:25:35.0899721Z 2025-05-07T20:25:35.0899725Z 2025-05-07T20:25:35.0899775Z 2025-05-07T20:25:35.0994465Z cuda-nvvp-12.6.80 | 109.3 MB | 3 | 3%  2025-05-07T20:25:35.1688111Z nsight-compute-2024. | 443.1 MB | ##### | 51% 2025-05-07T20:25:35.1688468Z 2025-05-07T20:25:35.1688479Z 2025-05-07T20:25:35.1692209Z 2025-05-07T20:25:35.1900616Z libcusparse-12.5.4.2 | 118.6 MB | ########## | 100%  2025-05-07T20:25:35.1901008Z 2025-05-07T20:25:35.1901013Z 2025-05-07T20:25:35.1901019Z 2025-05-07T20:25:35.1901024Z 2025-05-07T20:25:35.1902673Z 2025-05-07T20:25:35.1994888Z cuda-nvvp-12.6.80 | 109.3 MB | 6 | 7%  2025-05-07T20:25:35.1995885Z 2025-05-07T20:25:35.2042480Z libcublas-12.6.4.1 | 256.2 MB | ########7 | 87%  2025-05-07T20:25:35.2158622Z nsight-compute-2024. | 443.1 MB | #####1 | 52% 2025-05-07T20:25:35.2158993Z 2025-05-07T20:25:35.2158999Z 2025-05-07T20:25:35.2159004Z 2025-05-07T20:25:35.2159009Z 2025-05-07T20:25:35.2159014Z 2025-05-07T20:25:35.2161887Z 2025-05-07T20:25:35.2983303Z libcusolver-11.7.1.2 | 95.8 MB | | 0%  2025-05-07T20:25:35.2983715Z 2025-05-07T20:25:35.2983720Z 2025-05-07T20:25:35.2983725Z 2025-05-07T20:25:35.2983730Z 2025-05-07T20:25:35.2983734Z 2025-05-07T20:25:35.3165471Z cuda-nvvp-12.6.80 | 109.3 MB | 9 | 9%  2025-05-07T20:25:35.3165856Z 2025-05-07T20:25:35.3165861Z 2025-05-07T20:25:35.3165866Z 2025-05-07T20:25:35.3165871Z 2025-05-07T20:25:35.3165876Z 2025-05-07T20:25:35.3170113Z 2025-05-07T20:25:35.3408736Z libcusolver-11.7.1.2 | 95.8 MB | 2 | 3%  2025-05-07T20:25:35.3484745Z nsight-compute-2024. | 443.1 MB | #####2 | 53% 2025-05-07T20:25:35.3485118Z 2025-05-07T20:25:35.3985714Z libcublas-12.6.4.1 | 256.2 MB | ########9 | 89%  2025-05-07T20:25:35.3986071Z 2025-05-07T20:25:35.3986077Z 2025-05-07T20:25:35.3986082Z 2025-05-07T20:25:35.3986096Z 2025-05-07T20:25:35.3987461Z 2025-05-07T20:25:35.4173963Z cuda-nvvp-12.6.80 | 109.3 MB | #2 | 12%  2025-05-07T20:25:35.4174340Z 2025-05-07T20:25:35.4174351Z 2025-05-07T20:25:35.4174356Z 2025-05-07T20:25:35.4174373Z 2025-05-07T20:25:35.4174378Z 2025-05-07T20:25:35.4176817Z 2025-05-07T20:25:35.4720260Z libcusolver-11.7.1.2 | 95.8 MB | 5 | 5%  2025-05-07T20:25:35.4818358Z nsight-compute-2024. | 443.1 MB | #####3 | 53% 2025-05-07T20:25:35.4818718Z 2025-05-07T20:25:35.5107628Z libcublas-12.6.4.1 | 256.2 MB | ######### | 91%  2025-05-07T20:25:35.5107974Z 2025-05-07T20:25:35.5108005Z 2025-05-07T20:25:35.5108011Z 2025-05-07T20:25:35.5108016Z 2025-05-07T20:25:35.5108026Z 2025-05-07T20:25:35.5175944Z cuda-nvvp-12.6.80 | 109.3 MB | #4 | 15%  2025-05-07T20:25:35.5176330Z 2025-05-07T20:25:35.5176336Z 2025-05-07T20:25:35.5176342Z 2025-05-07T20:25:35.5176347Z 2025-05-07T20:25:35.5176352Z 2025-05-07T20:25:35.5179541Z 2025-05-07T20:25:35.5950859Z libcusolver-11.7.1.2 | 95.8 MB | 7 | 8%  2025-05-07T20:25:35.5953619Z 2025-05-07T20:25:35.5993403Z libcublas-12.6.4.1 | 256.2 MB | #########2 | 92%  2025-05-07T20:25:35.6107837Z nsight-compute-2024. | 443.1 MB | #####4 | 54% 2025-05-07T20:25:35.6108203Z 2025-05-07T20:25:35.6108209Z 2025-05-07T20:25:35.6108214Z 2025-05-07T20:25:35.6108219Z 2025-05-07T20:25:35.6108224Z 2025-05-07T20:25:35.6177331Z cuda-nvvp-12.6.80 | 109.3 MB | #7 | 17%  2025-05-07T20:25:35.6177677Z 2025-05-07T20:25:35.6177681Z 2025-05-07T20:25:35.6177703Z 2025-05-07T20:25:35.6177708Z 2025-05-07T20:25:35.6177712Z 2025-05-07T20:25:35.6179360Z 2025-05-07T20:25:35.7084142Z libcusolver-11.7.1.2 | 95.8 MB | # | 11%  2025-05-07T20:25:35.7085460Z 2025-05-07T20:25:35.7126326Z libcublas-12.6.4.1 | 256.2 MB | #########3 | 94%  2025-05-07T20:25:35.7163848Z nsight-compute-2024. | 443.1 MB | #####4 | 55% 2025-05-07T20:25:35.7164091Z 2025-05-07T20:25:35.7164095Z 2025-05-07T20:25:35.7164099Z 2025-05-07T20:25:35.7164103Z 2025-05-07T20:25:35.7168276Z 2025-05-07T20:25:35.7182192Z cuda-nvvp-12.6.80 | 109.3 MB | ## | 20%  2025-05-07T20:25:35.7182585Z 2025-05-07T20:25:35.7182590Z 2025-05-07T20:25:35.7182596Z 2025-05-07T20:25:35.7182601Z 2025-05-07T20:25:35.7182606Z 2025-05-07T20:25:35.7185316Z 2025-05-07T20:25:35.8167111Z libcusolver-11.7.1.2 | 95.8 MB | #3 | 14%  2025-05-07T20:25:35.8167412Z 2025-05-07T20:25:35.8167416Z 2025-05-07T20:25:35.8167420Z 2025-05-07T20:25:35.8167661Z 2025-05-07T20:25:35.8170838Z 2025-05-07T20:25:35.8185749Z cuda-nvvp-12.6.80 | 109.3 MB | ##2 | 23%  2025-05-07T20:25:35.8186249Z 2025-05-07T20:25:35.8186255Z 2025-05-07T20:25:35.8186259Z 2025-05-07T20:25:35.8186262Z 2025-05-07T20:25:35.8186266Z 2025-05-07T20:25:35.8189030Z 2025-05-07T20:25:35.8214507Z libcusolver-11.7.1.2 | 95.8 MB | #6 | 17%  2025-05-07T20:25:35.8272338Z nsight-compute-2024. | 443.1 MB | #####5 | 56% 2025-05-07T20:25:35.8274886Z 2025-05-07T20:25:35.9188499Z libcublas-12.6.4.1 | 256.2 MB | #########4 | 95%  2025-05-07T20:25:35.9188850Z 2025-05-07T20:25:35.9188854Z 2025-05-07T20:25:35.9188858Z 2025-05-07T20:25:35.9188862Z 2025-05-07T20:25:35.9188866Z 2025-05-07T20:25:35.9190358Z 2025-05-07T20:25:35.9201254Z libcusolver-11.7.1.2 | 95.8 MB | #9 | 20%  2025-05-07T20:25:35.9201665Z 2025-05-07T20:25:35.9201671Z 2025-05-07T20:25:35.9201676Z 2025-05-07T20:25:35.9201709Z 2025-05-07T20:25:35.9203306Z 2025-05-07T20:25:35.9303608Z cuda-nvvp-12.6.80 | 109.3 MB | ##5 | 25%  2025-05-07T20:25:35.9303919Z 2025-05-07T20:25:35.9311993Z libcublas-12.6.4.1 | 256.2 MB | #########6 | 96%  2025-05-07T20:25:36.0194504Z nsight-compute-2024. | 443.1 MB | #####6 | 56% 2025-05-07T20:25:36.0194801Z 2025-05-07T20:25:36.0194807Z 2025-05-07T20:25:36.0194812Z 2025-05-07T20:25:36.0194817Z 2025-05-07T20:25:36.0194822Z 2025-05-07T20:25:36.0194827Z 2025-05-07T20:25:36.0366733Z libcusolver-11.7.1.2 | 95.8 MB | ##3 | 23%  2025-05-07T20:25:36.0408305Z nsight-compute-2024. | 443.1 MB | #####6 | 57% 2025-05-07T20:25:36.0408678Z 2025-05-07T20:25:36.0408684Z 2025-05-07T20:25:36.0408689Z 2025-05-07T20:25:36.0408694Z 2025-05-07T20:25:36.0408699Z 2025-05-07T20:25:36.0525162Z cuda-nvvp-12.6.80 | 109.3 MB | ##7 | 28%  2025-05-07T20:25:36.0530513Z 2025-05-07T20:25:36.1197298Z libcublas-12.6.4.1 | 256.2 MB | #########7 | 97%  2025-05-07T20:25:36.1197716Z 2025-05-07T20:25:36.1197722Z 2025-05-07T20:25:36.1197727Z 2025-05-07T20:25:36.1197733Z 2025-05-07T20:25:36.1197750Z 2025-05-07T20:25:36.1197756Z 2025-05-07T20:25:36.1368627Z libcusolver-11.7.1.2 | 95.8 MB | ##6 | 26%  2025-05-07T20:25:36.1411862Z nsight-compute-2024. | 443.1 MB | #####7 | 57% 2025-05-07T20:25:36.1412223Z 2025-05-07T20:25:36.1412230Z 2025-05-07T20:25:36.1412235Z 2025-05-07T20:25:36.1412250Z 2025-05-07T20:25:36.1412255Z 2025-05-07T20:25:36.1620706Z cuda-nvvp-12.6.80 | 109.3 MB | ### | 30%  2025-05-07T20:25:36.1621088Z 2025-05-07T20:25:36.2226562Z libcublas-12.6.4.1 | 256.2 MB | #########8 | 99%  2025-05-07T20:25:36.2226924Z 2025-05-07T20:25:36.2226930Z 2025-05-07T20:25:36.2226935Z 2025-05-07T20:25:36.2226940Z 2025-05-07T20:25:36.2226945Z 2025-05-07T20:25:36.2230977Z 2025-05-07T20:25:36.2415261Z libcusolver-11.7.1.2 | 95.8 MB | ##9 | 29%  2025-05-07T20:25:36.2415701Z 2025-05-07T20:25:36.2415707Z 2025-05-07T20:25:36.2415713Z 2025-05-07T20:25:36.2415718Z 2025-05-07T20:25:36.2420048Z 2025-05-07T20:25:36.2449986Z cuda-nvvp-12.6.80 | 109.3 MB | ###2 | 33%  2025-05-07T20:25:36.2633998Z nsight-compute-2024. | 443.1 MB | #####8 | 58% 2025-05-07T20:25:36.2634350Z 2025-05-07T20:25:36.3229767Z libcublas-12.6.4.1 | 256.2 MB | #########9 | 100%  2025-05-07T20:25:36.3230146Z 2025-05-07T20:25:36.3230152Z 2025-05-07T20:25:36.3230158Z 2025-05-07T20:25:36.3230163Z 2025-05-07T20:25:36.3230168Z 2025-05-07T20:25:36.3230173Z 2025-05-07T20:25:36.3478473Z libcusolver-11.7.1.2 | 95.8 MB | ###2 | 33%  2025-05-07T20:25:36.3526277Z nsight-compute-2024. | 443.1 MB | #####8 | 59% 2025-05-07T20:25:36.3526712Z 2025-05-07T20:25:36.3526718Z 2025-05-07T20:25:36.3526723Z 2025-05-07T20:25:36.3526728Z 2025-05-07T20:25:36.3529967Z 2025-05-07T20:25:36.4329677Z cuda-nvvp-12.6.80 | 109.3 MB | ###5 | 35%  2025-05-07T20:25:36.4330365Z 2025-05-07T20:25:36.4330372Z 2025-05-07T20:25:36.4330377Z 2025-05-07T20:25:36.4330533Z 2025-05-07T20:25:36.4330539Z 2025-05-07T20:25:36.4330544Z 2025-05-07T20:25:36.4488296Z libcusolver-11.7.1.2 | 95.8 MB | ###5 | 36%  2025-05-07T20:25:36.4570302Z nsight-compute-2024. | 443.1 MB | #####9 | 59% 2025-05-07T20:25:36.4570659Z 2025-05-07T20:25:36.4570666Z 2025-05-07T20:25:36.4570671Z 2025-05-07T20:25:36.4570676Z 2025-05-07T20:25:36.4575956Z 2025-05-07T20:25:36.5331418Z cuda-nvvp-12.6.80 | 109.3 MB | ###7 | 38%  2025-05-07T20:25:36.5331703Z 2025-05-07T20:25:36.5331707Z 2025-05-07T20:25:36.5331711Z 2025-05-07T20:25:36.5331715Z 2025-05-07T20:25:36.5331730Z 2025-05-07T20:25:36.5332891Z 2025-05-07T20:25:36.5493021Z libcusolver-11.7.1.2 | 95.8 MB | ###9 | 39%  2025-05-07T20:25:36.5643382Z nsight-compute-2024. | 443.1 MB | ###### | 60% 2025-05-07T20:25:36.5643656Z 2025-05-07T20:25:36.5643661Z 2025-05-07T20:25:36.5643665Z 2025-05-07T20:25:36.5643668Z 2025-05-07T20:25:36.5647210Z 2025-05-07T20:25:36.6378210Z cuda-nvvp-12.6.80 | 109.3 MB | #### | 40%  2025-05-07T20:25:36.6386202Z 2025-05-07T20:25:36.6386207Z 2025-05-07T20:25:36.6386221Z 2025-05-07T20:25:36.6386225Z 2025-05-07T20:25:36.6386229Z 2025-05-07T20:25:36.6386232Z 2025-05-07T20:25:36.6495149Z libcusolver-11.7.1.2 | 95.8 MB | ####2 | 42%  2025-05-07T20:25:36.6645003Z nsight-compute-2024. | 443.1 MB | ###### | 61% 2025-05-07T20:25:36.6645347Z 2025-05-07T20:25:36.6645353Z 2025-05-07T20:25:36.6645358Z 2025-05-07T20:25:36.6645363Z 2025-05-07T20:25:36.6646772Z 2025-05-07T20:25:36.7380327Z cuda-nvvp-12.6.80 | 109.3 MB | ####2 | 43%  2025-05-07T20:25:36.7380718Z 2025-05-07T20:25:36.7380723Z 2025-05-07T20:25:36.7380729Z 2025-05-07T20:25:36.7380734Z 2025-05-07T20:25:36.7380739Z 2025-05-07T20:25:36.7380834Z 2025-05-07T20:25:36.7499665Z libcusolver-11.7.1.2 | 95.8 MB | ####5 | 46%  2025-05-07T20:25:36.7650789Z nsight-compute-2024. | 443.1 MB | ######1 | 61% 2025-05-07T20:25:36.7651152Z 2025-05-07T20:25:36.7651158Z 2025-05-07T20:25:36.7651164Z 2025-05-07T20:25:36.7651169Z 2025-05-07T20:25:36.7652923Z 2025-05-07T20:25:36.8380780Z cuda-nvvp-12.6.80 | 109.3 MB | ####5 | 46%  2025-05-07T20:25:36.8381142Z 2025-05-07T20:25:36.8381147Z 2025-05-07T20:25:36.8381151Z 2025-05-07T20:25:36.8381155Z 2025-05-07T20:25:36.8381158Z 2025-05-07T20:25:36.8385879Z 2025-05-07T20:25:36.8504838Z libcusolver-11.7.1.2 | 95.8 MB | ####9 | 49%  2025-05-07T20:25:36.8655770Z nsight-compute-2024. | 443.1 MB | ######2 | 62% 2025-05-07T20:25:36.8656164Z 2025-05-07T20:25:36.8656172Z 2025-05-07T20:25:36.8656179Z 2025-05-07T20:25:36.8656185Z 2025-05-07T20:25:36.8662418Z 2025-05-07T20:25:36.9390652Z cuda-nvvp-12.6.80 | 109.3 MB | ####8 | 49%  2025-05-07T20:25:36.9391016Z 2025-05-07T20:25:36.9391021Z 2025-05-07T20:25:36.9391025Z 2025-05-07T20:25:36.9391040Z 2025-05-07T20:25:36.9391044Z 2025-05-07T20:25:36.9391047Z 2025-05-07T20:25:36.9515456Z libcusolver-11.7.1.2 | 95.8 MB | #####2 | 52%  2025-05-07T20:25:36.9656797Z nsight-compute-2024. | 443.1 MB | ######2 | 63% 2025-05-07T20:25:36.9657094Z 2025-05-07T20:25:36.9657167Z 2025-05-07T20:25:36.9657173Z 2025-05-07T20:25:36.9657178Z 2025-05-07T20:25:36.9658487Z 2025-05-07T20:25:37.0399823Z cuda-nvvp-12.6.80 | 109.3 MB | #####1 | 51%  2025-05-07T20:25:37.0400131Z 2025-05-07T20:25:37.0400135Z 2025-05-07T20:25:37.0400139Z 2025-05-07T20:25:37.0400143Z 2025-05-07T20:25:37.0400147Z 2025-05-07T20:25:37.0400151Z 2025-05-07T20:25:37.0517238Z libcusolver-11.7.1.2 | 95.8 MB | #####5 | 56%  2025-05-07T20:25:37.0674856Z nsight-compute-2024. | 443.1 MB | ######3 | 63% 2025-05-07T20:25:37.0675446Z 2025-05-07T20:25:37.0675451Z 2025-05-07T20:25:37.0675455Z 2025-05-07T20:25:37.0675459Z 2025-05-07T20:25:37.0678289Z 2025-05-07T20:25:37.1413293Z cuda-nvvp-12.6.80 | 109.3 MB | #####4 | 54%  2025-05-07T20:25:37.1413683Z 2025-05-07T20:25:37.1413687Z 2025-05-07T20:25:37.1413691Z 2025-05-07T20:25:37.1413708Z 2025-05-07T20:25:37.1413712Z 2025-05-07T20:25:37.1416294Z 2025-05-07T20:25:37.1518862Z libcusolver-11.7.1.2 | 95.8 MB | #####9 | 59%  2025-05-07T20:25:37.1797544Z nsight-compute-2024. | 443.1 MB | ######4 | 64% 2025-05-07T20:25:37.1797791Z 2025-05-07T20:25:37.1797886Z 2025-05-07T20:25:37.1797893Z 2025-05-07T20:25:37.1797897Z 2025-05-07T20:25:37.1802364Z 2025-05-07T20:25:37.2419908Z cuda-nvvp-12.6.80 | 109.3 MB | #####6 | 57%  2025-05-07T20:25:37.2420212Z 2025-05-07T20:25:37.2420215Z 2025-05-07T20:25:37.2420220Z 2025-05-07T20:25:37.2420223Z 2025-05-07T20:25:37.2420227Z 2025-05-07T20:25:37.2420249Z 2025-05-07T20:25:37.2521791Z libcusolver-11.7.1.2 | 95.8 MB | ######2 | 63%  2025-05-07T20:25:37.2847563Z nsight-compute-2024. | 443.1 MB | ######4 | 65% 2025-05-07T20:25:37.2847828Z 2025-05-07T20:25:37.2847832Z 2025-05-07T20:25:37.2847836Z 2025-05-07T20:25:37.2847839Z 2025-05-07T20:25:37.2851323Z 2025-05-07T20:25:37.3577002Z cuda-nvvp-12.6.80 | 109.3 MB | #####9 | 59%  2025-05-07T20:25:37.3583969Z nsight-compute-2024. | 443.1 MB | ######5 | 66% 2025-05-07T20:25:37.3584223Z 2025-05-07T20:25:37.3584228Z 2025-05-07T20:25:37.3584242Z 2025-05-07T20:25:37.3584245Z 2025-05-07T20:25:37.3584249Z 2025-05-07T20:25:37.3584352Z 2025-05-07T20:25:37.3924039Z libcusolver-11.7.1.2 | 95.8 MB | ######5 | 66%  2025-05-07T20:25:37.3924343Z 2025-05-07T20:25:37.3924347Z 2025-05-07T20:25:37.3924351Z 2025-05-07T20:25:37.3924355Z 2025-05-07T20:25:37.3924358Z 2025-05-07T20:25:37.4583875Z cuda-nvvp-12.6.80 | 109.3 MB | ######1 | 62%  2025-05-07T20:25:37.4690896Z nsight-compute-2024. | 443.1 MB | ######6 | 66% 2025-05-07T20:25:37.4691150Z 2025-05-07T20:25:37.4691167Z 2025-05-07T20:25:37.4691172Z 2025-05-07T20:25:37.4691176Z 2025-05-07T20:25:37.4691180Z 2025-05-07T20:25:37.4693441Z 2025-05-07T20:25:37.4926915Z libcusolver-11.7.1.2 | 95.8 MB | ######9 | 69%  2025-05-07T20:25:37.4927251Z 2025-05-07T20:25:37.4927254Z 2025-05-07T20:25:37.4927258Z 2025-05-07T20:25:37.4927262Z 2025-05-07T20:25:37.4932335Z 2025-05-07T20:25:37.5687003Z cuda-nvvp-12.6.80 | 109.3 MB | ######4 | 64%  2025-05-07T20:25:37.5870924Z nsight-compute-2024. | 443.1 MB | ######6 | 67% 2025-05-07T20:25:37.5871264Z 2025-05-07T20:25:37.5871270Z 2025-05-07T20:25:37.5871275Z 2025-05-07T20:25:37.5871280Z 2025-05-07T20:25:37.5871285Z 2025-05-07T20:25:37.5871290Z 2025-05-07T20:25:37.5936328Z libcusolver-11.7.1.2 | 95.8 MB | #######2 | 72%  2025-05-07T20:25:37.5936638Z 2025-05-07T20:25:37.5936642Z 2025-05-07T20:25:37.5936646Z 2025-05-07T20:25:37.5936649Z 2025-05-07T20:25:37.5943852Z 2025-05-07T20:25:37.6732709Z cuda-nvvp-12.6.80 | 109.3 MB | ######6 | 67%  2025-05-07T20:25:37.6874316Z nsight-compute-2024. | 443.1 MB | ######7 | 68% 2025-05-07T20:25:37.6874599Z 2025-05-07T20:25:37.6874605Z 2025-05-07T20:25:37.6874610Z 2025-05-07T20:25:37.6874615Z 2025-05-07T20:25:37.6874620Z 2025-05-07T20:25:37.6879777Z 2025-05-07T20:25:37.6940825Z libcusolver-11.7.1.2 | 95.8 MB | #######5 | 75%  2025-05-07T20:25:37.6941116Z 2025-05-07T20:25:37.6941120Z 2025-05-07T20:25:37.6941124Z 2025-05-07T20:25:37.6941128Z 2025-05-07T20:25:37.6941132Z 2025-05-07T20:25:37.7735373Z cuda-nvvp-12.6.80 | 109.3 MB | ######9 | 69%  2025-05-07T20:25:37.7880550Z nsight-compute-2024. | 443.1 MB | ######8 | 68% 2025-05-07T20:25:37.7880811Z 2025-05-07T20:25:37.7880815Z 2025-05-07T20:25:37.7880818Z 2025-05-07T20:25:37.7881097Z 2025-05-07T20:25:37.7881101Z 2025-05-07T20:25:37.7881546Z 2025-05-07T20:25:37.7944470Z libcusolver-11.7.1.2 | 95.8 MB | #######8 | 79%  2025-05-07T20:25:37.7944784Z 2025-05-07T20:25:37.7944790Z 2025-05-07T20:25:37.7944795Z 2025-05-07T20:25:37.7944799Z 2025-05-07T20:25:37.7946850Z 2025-05-07T20:25:37.8326490Z cuda-nvvp-12.6.80 | 109.3 MB | #######2 | 72%  2025-05-07T20:25:37.8326871Z 2025-05-07T20:25:37.8329957Z 2025-05-07T20:25:37.8840417Z libcufft-11.3.0.4 | 156.2 MB | ########## | 100%  2025-05-07T20:25:37.8866268Z nsight-compute-2024. | 443.1 MB | ######9 | 69% 2025-05-07T20:25:37.8866632Z 2025-05-07T20:25:37.8866638Z 2025-05-07T20:25:37.8866643Z 2025-05-07T20:25:37.8866648Z 2025-05-07T20:25:37.8866653Z 2025-05-07T20:25:37.8866658Z 2025-05-07T20:25:37.8870242Z 2025-05-07T20:25:37.8952716Z libnpp-12.3.1.54 | 93.4 MB | | 0%  2025-05-07T20:25:37.8953111Z 2025-05-07T20:25:37.8953134Z 2025-05-07T20:25:37.8953140Z 2025-05-07T20:25:37.8953145Z 2025-05-07T20:25:37.8953151Z 2025-05-07T20:25:37.8953156Z 2025-05-07T20:25:37.9045942Z libcusolver-11.7.1.2 | 95.8 MB | ########1 | 82%  2025-05-07T20:25:37.9046403Z 2025-05-07T20:25:37.9046409Z 2025-05-07T20:25:37.9046414Z 2025-05-07T20:25:37.9046419Z 2025-05-07T20:25:37.9048781Z 2025-05-07T20:25:37.9868448Z cuda-nvvp-12.6.80 | 109.3 MB | #######4 | 75%  2025-05-07T20:25:37.9868840Z 2025-05-07T20:25:37.9868846Z 2025-05-07T20:25:37.9868851Z 2025-05-07T20:25:37.9868856Z 2025-05-07T20:25:37.9868861Z 2025-05-07T20:25:37.9868866Z 2025-05-07T20:25:37.9872799Z 2025-05-07T20:25:38.0075174Z libnpp-12.3.1.54 | 93.4 MB | 3 | 3%  2025-05-07T20:25:38.0091454Z nsight-compute-2024. | 443.1 MB | ######9 | 70% 2025-05-07T20:25:38.0091814Z 2025-05-07T20:25:38.0091821Z 2025-05-07T20:25:38.0091826Z 2025-05-07T20:25:38.0091832Z 2025-05-07T20:25:38.0091850Z 2025-05-07T20:25:38.0145835Z cuda-nvvp-12.6.80 | 109.3 MB | #######7 | 77%  2025-05-07T20:25:38.0146213Z 2025-05-07T20:25:38.0146232Z 2025-05-07T20:25:38.0146237Z 2025-05-07T20:25:38.0146242Z 2025-05-07T20:25:38.0146247Z 2025-05-07T20:25:38.0147812Z 2025-05-07T20:25:38.0869196Z libcusolver-11.7.1.2 | 95.8 MB | ########4 | 85%  2025-05-07T20:25:38.0869598Z 2025-05-07T20:25:38.0869603Z 2025-05-07T20:25:38.0869608Z 2025-05-07T20:25:38.0869613Z 2025-05-07T20:25:38.0869618Z 2025-05-07T20:25:38.0869623Z 2025-05-07T20:25:38.0871280Z 2025-05-07T20:25:38.1175727Z libnpp-12.3.1.54 | 93.4 MB | 5 | 6%  2025-05-07T20:25:38.1300880Z nsight-compute-2024. | 443.1 MB | ####### | 70% 2025-05-07T20:25:38.1301236Z 2025-05-07T20:25:38.1301242Z 2025-05-07T20:25:38.1301247Z 2025-05-07T20:25:38.1301252Z 2025-05-07T20:25:38.1301257Z 2025-05-07T20:25:38.1452915Z cuda-nvvp-12.6.80 | 109.3 MB | #######9 | 80%  2025-05-07T20:25:38.1453310Z 2025-05-07T20:25:38.1453316Z 2025-05-07T20:25:38.1453321Z 2025-05-07T20:25:38.1453326Z 2025-05-07T20:25:38.1453331Z 2025-05-07T20:25:38.1453346Z 2025-05-07T20:25:38.1873261Z libcusolver-11.7.1.2 | 95.8 MB | ########7 | 88%  2025-05-07T20:25:38.1873668Z 2025-05-07T20:25:38.1873674Z 2025-05-07T20:25:38.1873679Z 2025-05-07T20:25:38.1873684Z 2025-05-07T20:25:38.1873689Z 2025-05-07T20:25:38.1873694Z 2025-05-07T20:25:38.1876882Z 2025-05-07T20:25:38.2226750Z libnpp-12.3.1.54 | 93.4 MB | 8 | 8%  2025-05-07T20:25:38.2304666Z nsight-compute-2024. | 443.1 MB | ####### | 71% 2025-05-07T20:25:38.2305020Z 2025-05-07T20:25:38.2305025Z 2025-05-07T20:25:38.2305031Z 2025-05-07T20:25:38.2305036Z 2025-05-07T20:25:38.2305041Z 2025-05-07T20:25:38.2635540Z cuda-nvvp-12.6.80 | 109.3 MB | ########2 | 82%  2025-05-07T20:25:38.2635929Z 2025-05-07T20:25:38.2635934Z 2025-05-07T20:25:38.2636176Z 2025-05-07T20:25:38.2636181Z 2025-05-07T20:25:38.2636187Z 2025-05-07T20:25:38.2636192Z 2025-05-07T20:25:38.2882455Z libcusolver-11.7.1.2 | 95.8 MB | ######### | 90%  2025-05-07T20:25:38.2882878Z 2025-05-07T20:25:38.2882883Z 2025-05-07T20:25:38.2882888Z 2025-05-07T20:25:38.2882894Z 2025-05-07T20:25:38.2882899Z 2025-05-07T20:25:38.2882904Z 2025-05-07T20:25:38.2887258Z 2025-05-07T20:25:38.3290559Z libnpp-12.3.1.54 | 93.4 MB | #1 | 11%  2025-05-07T20:25:38.3309314Z nsight-compute-2024. | 443.1 MB | #######1 | 71% 2025-05-07T20:25:38.3309671Z 2025-05-07T20:25:38.3309677Z 2025-05-07T20:25:38.3309682Z 2025-05-07T20:25:38.3309687Z 2025-05-07T20:25:38.3313715Z 2025-05-07T20:25:38.3716115Z cuda-nvvp-12.6.80 | 109.3 MB | ########4 | 85%  2025-05-07T20:25:38.3716505Z 2025-05-07T20:25:38.3716510Z 2025-05-07T20:25:38.3716515Z 2025-05-07T20:25:38.3716520Z 2025-05-07T20:25:38.3716525Z 2025-05-07T20:25:38.3716530Z 2025-05-07T20:25:38.3890351Z libcusolver-11.7.1.2 | 95.8 MB | #########2 | 93%  2025-05-07T20:25:38.3890727Z 2025-05-07T20:25:38.3890731Z 2025-05-07T20:25:38.3890747Z 2025-05-07T20:25:38.3890751Z 2025-05-07T20:25:38.3890755Z 2025-05-07T20:25:38.3890758Z 2025-05-07T20:25:38.3890851Z 2025-05-07T20:25:38.4290690Z libnpp-12.3.1.54 | 93.4 MB | #4 | 14%  2025-05-07T20:25:38.4372678Z nsight-compute-2024. | 443.1 MB | #######2 | 72% 2025-05-07T20:25:38.4373035Z 2025-05-07T20:25:38.4373041Z 2025-05-07T20:25:38.4373046Z 2025-05-07T20:25:38.4373051Z 2025-05-07T20:25:38.4377291Z 2025-05-07T20:25:38.4833277Z cuda-nvvp-12.6.80 | 109.3 MB | ########6 | 87%  2025-05-07T20:25:38.4833657Z 2025-05-07T20:25:38.4833661Z 2025-05-07T20:25:38.4833664Z 2025-05-07T20:25:38.4833668Z 2025-05-07T20:25:38.4833672Z 2025-05-07T20:25:38.4833676Z 2025-05-07T20:25:38.4899363Z libcusolver-11.7.1.2 | 95.8 MB | #########5 | 95%  2025-05-07T20:25:38.4899717Z 2025-05-07T20:25:38.4899722Z 2025-05-07T20:25:38.4899725Z 2025-05-07T20:25:38.4899729Z 2025-05-07T20:25:38.4899733Z 2025-05-07T20:25:38.4899745Z 2025-05-07T20:25:38.4899749Z 2025-05-07T20:25:38.5299980Z libnpp-12.3.1.54 | 93.4 MB | #6 | 17%  2025-05-07T20:25:38.5449731Z nsight-compute-2024. | 443.1 MB | #######2 | 73% 2025-05-07T20:25:38.5450082Z 2025-05-07T20:25:38.5450087Z 2025-05-07T20:25:38.5450091Z 2025-05-07T20:25:38.5450095Z 2025-05-07T20:25:38.5452035Z 2025-05-07T20:25:38.5869639Z cuda-nvvp-12.6.80 | 109.3 MB | ########9 | 89%  2025-05-07T20:25:38.5870011Z 2025-05-07T20:25:38.5870015Z 2025-05-07T20:25:38.5870019Z 2025-05-07T20:25:38.5870023Z 2025-05-07T20:25:38.5870027Z 2025-05-07T20:25:38.5870031Z 2025-05-07T20:25:38.5939973Z libcusolver-11.7.1.2 | 95.8 MB | #########7 | 98%  2025-05-07T20:25:38.5940290Z 2025-05-07T20:25:38.5940294Z 2025-05-07T20:25:38.5940298Z 2025-05-07T20:25:38.5940312Z 2025-05-07T20:25:38.5940320Z 2025-05-07T20:25:38.5940325Z 2025-05-07T20:25:38.5940330Z 2025-05-07T20:25:38.6341635Z libnpp-12.3.1.54 | 93.4 MB | #9 | 20%  2025-05-07T20:25:38.6413203Z nsight-compute-2024. | 443.1 MB | #######3 | 73% 2025-05-07T20:25:38.6413565Z 2025-05-07T20:25:38.6413572Z 2025-05-07T20:25:38.6413577Z 2025-05-07T20:25:38.6413583Z 2025-05-07T20:25:38.6472882Z cuda-nsight-12.6.77 | 113.2 MB | ########## | 100%  2025-05-07T20:25:38.6473155Z 2025-05-07T20:25:38.6473159Z 2025-05-07T20:25:38.6473169Z 2025-05-07T20:25:38.6473173Z 2025-05-07T20:25:38.6476492Z 2025-05-07T20:25:38.6940685Z cuda-nvvp-12.6.80 | 109.3 MB | #########1 | 92%  2025-05-07T20:25:38.6940954Z 2025-05-07T20:25:38.6940966Z 2025-05-07T20:25:38.6940970Z 2025-05-07T20:25:38.6940973Z 2025-05-07T20:25:38.6940977Z 2025-05-07T20:25:38.6940981Z 2025-05-07T20:25:38.6946390Z 2025-05-07T20:25:38.7341896Z libnpp-12.3.1.54 | 93.4 MB | ##2 | 23%  2025-05-07T20:25:38.7473370Z nsight-compute-2024. | 443.1 MB | #######3 | 74% 2025-05-07T20:25:38.7473614Z 2025-05-07T20:25:38.7473784Z 2025-05-07T20:25:38.7473789Z 2025-05-07T20:25:38.7473793Z 2025-05-07T20:25:38.7475249Z 2025-05-07T20:25:38.7949925Z cuda-nvvp-12.6.80 | 109.3 MB | #########4 | 95%  2025-05-07T20:25:38.7950194Z 2025-05-07T20:25:38.7950198Z 2025-05-07T20:25:38.7950201Z 2025-05-07T20:25:38.7950213Z 2025-05-07T20:25:38.7950217Z 2025-05-07T20:25:38.7950221Z 2025-05-07T20:25:38.7951996Z 2025-05-07T20:25:38.8346270Z libnpp-12.3.1.54 | 93.4 MB | ##5 | 26%  2025-05-07T20:25:38.8474901Z nsight-compute-2024. | 443.1 MB | #######4 | 75% 2025-05-07T20:25:38.8475147Z 2025-05-07T20:25:38.8475151Z 2025-05-07T20:25:38.8475155Z 2025-05-07T20:25:38.8475165Z 2025-05-07T20:25:38.8476681Z 2025-05-07T20:25:38.8950298Z cuda-nvvp-12.6.80 | 109.3 MB | #########8 | 98%  2025-05-07T20:25:38.8950588Z 2025-05-07T20:25:38.8950592Z 2025-05-07T20:25:38.8950603Z 2025-05-07T20:25:38.8950607Z 2025-05-07T20:25:38.8950611Z 2025-05-07T20:25:38.8950620Z 2025-05-07T20:25:38.8955426Z 2025-05-07T20:25:38.9346983Z libnpp-12.3.1.54 | 93.4 MB | ##9 | 29%  2025-05-07T20:25:38.9952321Z nsight-compute-2024. | 443.1 MB | #######5 | 75% 2025-05-07T20:25:38.9952699Z 2025-05-07T20:25:38.9952705Z 2025-05-07T20:25:38.9952710Z 2025-05-07T20:25:38.9952716Z 2025-05-07T20:25:38.9952720Z 2025-05-07T20:25:38.9952726Z 2025-05-07T20:25:38.9952730Z 2025-05-07T20:25:39.0352117Z libnpp-12.3.1.54 | 93.4 MB | ###2 | 33%  2025-05-07T20:25:39.0956382Z nsight-compute-2024. | 443.1 MB | #######6 | 76% 2025-05-07T20:25:39.0956737Z 2025-05-07T20:25:39.0956742Z 2025-05-07T20:25:39.0956746Z 2025-05-07T20:25:39.0956749Z 2025-05-07T20:25:39.0956753Z 2025-05-07T20:25:39.0956757Z 2025-05-07T20:25:39.0958781Z 2025-05-07T20:25:39.1353323Z libnpp-12.3.1.54 | 93.4 MB | ###6 | 36%  2025-05-07T20:25:39.1962515Z nsight-compute-2024. | 443.1 MB | #######6 | 77% 2025-05-07T20:25:39.1962789Z 2025-05-07T20:25:39.1962793Z 2025-05-07T20:25:39.1962797Z 2025-05-07T20:25:39.1962801Z 2025-05-07T20:25:39.1962805Z 2025-05-07T20:25:39.1962815Z 2025-05-07T20:25:39.1965751Z 2025-05-07T20:25:39.2354731Z libnpp-12.3.1.54 | 93.4 MB | ###9 | 40%  2025-05-07T20:25:39.2968064Z nsight-compute-2024. | 443.1 MB | #######7 | 78% 2025-05-07T20:25:39.2968419Z 2025-05-07T20:25:39.2968424Z 2025-05-07T20:25:39.2968427Z 2025-05-07T20:25:39.2968431Z 2025-05-07T20:25:39.2968434Z 2025-05-07T20:25:39.2968438Z 2025-05-07T20:25:39.2968441Z 2025-05-07T20:25:39.3357704Z libnpp-12.3.1.54 | 93.4 MB | ####3 | 43%  2025-05-07T20:25:39.3968263Z nsight-compute-2024. | 443.1 MB | #######8 | 79% 2025-05-07T20:25:39.3968619Z 2025-05-07T20:25:39.3968626Z 2025-05-07T20:25:39.3968646Z 2025-05-07T20:25:39.3968652Z 2025-05-07T20:25:39.3968796Z 2025-05-07T20:25:39.3968801Z 2025-05-07T20:25:39.3970232Z 2025-05-07T20:25:39.4358778Z libnpp-12.3.1.54 | 93.4 MB | ####6 | 47%  2025-05-07T20:25:39.4974420Z nsight-compute-2024. | 443.1 MB | #######9 | 80% 2025-05-07T20:25:39.4974804Z 2025-05-07T20:25:39.4974810Z 2025-05-07T20:25:39.4974816Z 2025-05-07T20:25:39.4974824Z 2025-05-07T20:25:39.4974966Z 2025-05-07T20:25:39.4974973Z 2025-05-07T20:25:39.4976584Z 2025-05-07T20:25:39.5359019Z libnpp-12.3.1.54 | 93.4 MB | ##### | 51%  2025-05-07T20:25:39.5997286Z nsight-compute-2024. | 443.1 MB | ######## | 80% 2025-05-07T20:25:39.5997648Z 2025-05-07T20:25:39.5997652Z 2025-05-07T20:25:39.5997656Z 2025-05-07T20:25:39.5997660Z 2025-05-07T20:25:39.5997663Z 2025-05-07T20:25:39.5997667Z 2025-05-07T20:25:39.6000003Z 2025-05-07T20:25:39.6380077Z libnpp-12.3.1.54 | 93.4 MB | #####4 | 55%  2025-05-07T20:25:39.7046325Z nsight-compute-2024. | 443.1 MB | ########1 | 81% 2025-05-07T20:25:39.7046609Z 2025-05-07T20:25:39.7046824Z 2025-05-07T20:25:39.7046832Z 2025-05-07T20:25:39.7046837Z 2025-05-07T20:25:39.7046842Z 2025-05-07T20:25:39.7046847Z 2025-05-07T20:25:39.7049248Z 2025-05-07T20:25:39.7382894Z libnpp-12.3.1.54 | 93.4 MB | #####8 | 58%  2025-05-07T20:25:39.8046446Z nsight-compute-2024. | 443.1 MB | ########2 | 82% 2025-05-07T20:25:39.8046789Z 2025-05-07T20:25:39.8046793Z 2025-05-07T20:25:39.8046796Z 2025-05-07T20:25:39.8046800Z 2025-05-07T20:25:39.8046804Z 2025-05-07T20:25:39.8046808Z 2025-05-07T20:25:39.8048434Z 2025-05-07T20:25:39.8435065Z libnpp-12.3.1.54 | 93.4 MB | ######2 | 62%  2025-05-07T20:25:39.9048099Z nsight-compute-2024. | 443.1 MB | ########2 | 83% 2025-05-07T20:25:39.9048410Z 2025-05-07T20:25:39.9048415Z 2025-05-07T20:25:39.9048419Z 2025-05-07T20:25:39.9048422Z 2025-05-07T20:25:39.9048444Z 2025-05-07T20:25:39.9048448Z 2025-05-07T20:25:39.9049695Z 2025-05-07T20:25:40.0054381Z libnpp-12.3.1.54 | 93.4 MB | ######6 | 66%  2025-05-07T20:25:40.0054671Z 2025-05-07T20:25:40.0054677Z 2025-05-07T20:25:40.0054681Z 2025-05-07T20:25:40.0054685Z 2025-05-07T20:25:40.0054689Z 2025-05-07T20:25:40.0054701Z 2025-05-07T20:25:40.0054708Z 2025-05-07T20:25:40.0985853Z libnpp-12.3.1.54 | 93.4 MB | ####### | 71%  2025-05-07T20:25:40.1985694Z nsight-compute-2024. | 443.1 MB | ########3 | 84% 2025-05-07T20:25:40.2746789Z nsight-compute-2024. | 443.1 MB | ########4 | 85% 2025-05-07T20:25:40.2747059Z 2025-05-07T20:25:40.2747130Z 2025-05-07T20:25:40.2747137Z 2025-05-07T20:25:40.2747143Z 2025-05-07T20:25:40.2747148Z 2025-05-07T20:25:40.2747153Z 2025-05-07T20:25:40.2749671Z 2025-05-07T20:25:40.2987429Z libnpp-12.3.1.54 | 93.4 MB | #######4 | 75%  2025-05-07T20:25:40.3747425Z nsight-compute-2024. | 443.1 MB | ########5 | 85% 2025-05-07T20:25:40.3747698Z 2025-05-07T20:25:40.3747779Z 2025-05-07T20:25:40.3747783Z 2025-05-07T20:25:40.3747904Z 2025-05-07T20:25:40.3747930Z 2025-05-07T20:25:40.3747936Z 2025-05-07T20:25:40.3750635Z 2025-05-07T20:25:40.3988777Z libnpp-12.3.1.54 | 93.4 MB | #######7 | 78%  2025-05-07T20:25:40.4747610Z nsight-compute-2024. | 443.1 MB | ########6 | 86% 2025-05-07T20:25:40.4747875Z 2025-05-07T20:25:40.4747879Z 2025-05-07T20:25:40.4747883Z 2025-05-07T20:25:40.4747887Z 2025-05-07T20:25:40.4747891Z 2025-05-07T20:25:40.4747894Z 2025-05-07T20:25:40.4748010Z 2025-05-07T20:25:40.5001783Z libnpp-12.3.1.54 | 93.4 MB | ########2 | 82%  2025-05-07T20:25:40.5749520Z nsight-compute-2024. | 443.1 MB | ########7 | 87% 2025-05-07T20:25:40.5749779Z 2025-05-07T20:25:40.5749895Z 2025-05-07T20:25:40.5749900Z 2025-05-07T20:25:40.5750024Z 2025-05-07T20:25:40.5750031Z 2025-05-07T20:25:40.5750037Z 2025-05-07T20:25:40.5750057Z 2025-05-07T20:25:40.6006636Z libnpp-12.3.1.54 | 93.4 MB | ########5 | 86%  2025-05-07T20:25:40.6800364Z nsight-compute-2024. | 443.1 MB | ########7 | 88% 2025-05-07T20:25:40.6800635Z 2025-05-07T20:25:40.6800639Z 2025-05-07T20:25:40.6800643Z 2025-05-07T20:25:40.6800647Z 2025-05-07T20:25:40.6800651Z 2025-05-07T20:25:40.6800654Z 2025-05-07T20:25:40.6800658Z 2025-05-07T20:25:40.7011610Z libnpp-12.3.1.54 | 93.4 MB | ########9 | 90%  2025-05-07T20:25:40.7802440Z nsight-compute-2024. | 443.1 MB | ########8 | 89% 2025-05-07T20:25:40.7802708Z 2025-05-07T20:25:40.7802712Z 2025-05-07T20:25:40.7802717Z 2025-05-07T20:25:40.7802720Z 2025-05-07T20:25:40.7802724Z 2025-05-07T20:25:40.7802728Z 2025-05-07T20:25:40.7802866Z 2025-05-07T20:25:40.8018666Z libnpp-12.3.1.54 | 93.4 MB | #########3 | 93%  2025-05-07T20:25:40.8807613Z nsight-compute-2024. | 443.1 MB | ########9 | 90% 2025-05-07T20:25:40.8808234Z 2025-05-07T20:25:40.8808240Z 2025-05-07T20:25:40.8808246Z 2025-05-07T20:25:40.8808251Z 2025-05-07T20:25:40.8808257Z 2025-05-07T20:25:40.8808261Z 2025-05-07T20:25:40.8812355Z 2025-05-07T20:25:40.9585658Z libnpp-12.3.1.54 | 93.4 MB | #########7 | 97%  2025-05-07T20:25:41.0737865Z nsight-compute-2024. | 443.1 MB | ######### | 91% 2025-05-07T20:25:41.1745232Z nsight-compute-2024. | 443.1 MB | #########1 | 91% 2025-05-07T20:25:41.2748987Z nsight-compute-2024. | 443.1 MB | #########2 | 92% 2025-05-07T20:25:41.3749462Z nsight-compute-2024. | 443.1 MB | #########3 | 93% 2025-05-07T20:25:41.4695320Z nsight-compute-2024. | 443.1 MB | #########4 | 94% 2025-05-07T20:25:41.4695577Z 2025-05-07T20:25:41.4695906Z 2025-05-07T20:25:41.4695910Z 2025-05-07T20:25:41.4695922Z 2025-05-07T20:25:41.4695959Z 2025-05-07T20:25:41.4695963Z 2025-05-07T20:25:41.4749773Z libcusolver-11.7.1.2 | 95.8 MB | ########## | 100%  2025-05-07T20:25:41.5053977Z nsight-compute-2024. | 443.1 MB | #########5 | 95% 2025-05-07T20:25:41.5054297Z 2025-05-07T20:25:41.5054494Z 2025-05-07T20:25:41.5054500Z 2025-05-07T20:25:41.5054514Z 2025-05-07T20:25:41.5054519Z 2025-05-07T20:25:41.5054524Z 2025-05-07T20:25:41.5054529Z 2025-05-07T20:25:41.5055703Z 2025-05-07T20:25:41.6041821Z cuda-nvdisasm-12.6.7 | 47.6 MB | | 0%  2025-05-07T20:25:41.6069508Z nsight-compute-2024. | 443.1 MB | #########5 | 96% 2025-05-07T20:25:41.6069766Z 2025-05-07T20:25:41.6069972Z 2025-05-07T20:25:41.6069977Z 2025-05-07T20:25:41.6069996Z 2025-05-07T20:25:41.6070000Z 2025-05-07T20:25:41.6070004Z 2025-05-07T20:25:41.6070008Z 2025-05-07T20:25:41.6074174Z 2025-05-07T20:25:41.7201608Z cuda-nvdisasm-12.6.7 | 47.6 MB | 7 | 7%  2025-05-07T20:25:41.7201923Z 2025-05-07T20:25:41.7201927Z 2025-05-07T20:25:41.7201931Z 2025-05-07T20:25:41.7201935Z 2025-05-07T20:25:41.7201939Z 2025-05-07T20:25:41.7201943Z 2025-05-07T20:25:41.7201955Z 2025-05-07T20:25:41.7204133Z 2025-05-07T20:25:41.7296318Z cuda-nvdisasm-12.6.7 | 47.6 MB | #4 | 14%  2025-05-07T20:25:41.8278388Z nsight-compute-2024. | 443.1 MB | #########6 | 97% 2025-05-07T20:25:41.8278704Z 2025-05-07T20:25:41.8278710Z 2025-05-07T20:25:41.8278715Z 2025-05-07T20:25:41.8278720Z 2025-05-07T20:25:41.8278725Z 2025-05-07T20:25:41.8278730Z 2025-05-07T20:25:41.8278735Z 2025-05-07T20:25:41.8282430Z 2025-05-07T20:25:41.8394983Z cuda-nvdisasm-12.6.7 | 47.6 MB | ## | 21%  2025-05-07T20:25:41.9357584Z nsight-compute-2024. | 443.1 MB | #########7 | 98% 2025-05-07T20:25:41.9357945Z 2025-05-07T20:25:41.9357951Z 2025-05-07T20:25:41.9357956Z 2025-05-07T20:25:41.9357961Z 2025-05-07T20:25:41.9357966Z 2025-05-07T20:25:41.9357971Z 2025-05-07T20:25:41.9357976Z 2025-05-07T20:25:41.9361579Z 2025-05-07T20:25:41.9520477Z cuda-nvdisasm-12.6.7 | 47.6 MB | ##7 | 27%  2025-05-07T20:25:42.0364145Z nsight-compute-2024. | 443.1 MB | #########8 | 98% 2025-05-07T20:25:42.0364446Z 2025-05-07T20:25:42.0364450Z 2025-05-07T20:25:42.0364454Z 2025-05-07T20:25:42.0364466Z 2025-05-07T20:25:42.0364470Z 2025-05-07T20:25:42.0364473Z 2025-05-07T20:25:42.0364477Z 2025-05-07T20:25:42.0364681Z 2025-05-07T20:25:42.0716712Z cuda-nvdisasm-12.6.7 | 47.6 MB | ###4 | 34%  2025-05-07T20:25:42.1366486Z nsight-compute-2024. | 443.1 MB | #########9 | 99% 2025-05-07T20:25:42.1366760Z 2025-05-07T20:25:42.1366764Z 2025-05-07T20:25:42.1366768Z 2025-05-07T20:25:42.1366771Z 2025-05-07T20:25:42.1366775Z 2025-05-07T20:25:42.1366778Z 2025-05-07T20:25:42.1366782Z 2025-05-07T20:25:42.1369252Z 2025-05-07T20:25:42.1739352Z cuda-nvdisasm-12.6.7 | 47.6 MB | #### | 41%  2025-05-07T20:25:42.2378448Z nsight-compute-2024. | 443.1 MB | #########9 | 100% 2025-05-07T20:25:42.2378713Z 2025-05-07T20:25:42.2378717Z 2025-05-07T20:25:42.2378721Z 2025-05-07T20:25:42.2378970Z 2025-05-07T20:25:42.2378976Z 2025-05-07T20:25:42.2378981Z 2025-05-07T20:25:42.2378986Z 2025-05-07T20:25:42.2383246Z 2025-05-07T20:25:42.2778015Z cuda-nvdisasm-12.6.7 | 47.6 MB | ####7 | 47%  2025-05-07T20:25:42.2778335Z 2025-05-07T20:25:42.2778339Z 2025-05-07T20:25:42.2778343Z 2025-05-07T20:25:42.2778346Z 2025-05-07T20:25:42.2778350Z 2025-05-07T20:25:42.3383161Z cuda-nvvp-12.6.80 | 109.3 MB | ########## | 100%  2025-05-07T20:25:42.3383468Z 2025-05-07T20:25:42.3383474Z 2025-05-07T20:25:42.3383479Z 2025-05-07T20:25:42.3383484Z 2025-05-07T20:25:42.3383489Z 2025-05-07T20:25:42.3383494Z 2025-05-07T20:25:42.3383510Z 2025-05-07T20:25:42.3385784Z 2025-05-07T20:25:42.3467610Z cuda-nvdisasm-12.6.7 | 47.6 MB | #####4 | 55%  2025-05-07T20:25:42.3467996Z 2025-05-07T20:25:42.3468000Z 2025-05-07T20:25:42.3468014Z 2025-05-07T20:25:42.3468017Z 2025-05-07T20:25:42.3468021Z 2025-05-07T20:25:42.3468036Z 2025-05-07T20:25:42.3468040Z 2025-05-07T20:25:42.3468043Z 2025-05-07T20:25:42.3469470Z 2025-05-07T20:25:42.4468279Z libcurand-10.3.7.77 | 39.9 MB | | 0%  2025-05-07T20:25:42.4468695Z 2025-05-07T20:25:42.4468702Z 2025-05-07T20:25:42.4468706Z 2025-05-07T20:25:42.4468710Z 2025-05-07T20:25:42.4468714Z 2025-05-07T20:25:42.4468717Z 2025-05-07T20:25:42.4468721Z 2025-05-07T20:25:42.4468725Z 2025-05-07T20:25:42.4470543Z 2025-05-07T20:25:42.4496904Z libcurand-10.3.7.77 | 39.9 MB | 7 | 7%  2025-05-07T20:25:42.4497198Z 2025-05-07T20:25:42.4497202Z 2025-05-07T20:25:42.4497206Z 2025-05-07T20:25:42.4497209Z 2025-05-07T20:25:42.4497213Z 2025-05-07T20:25:42.4497219Z 2025-05-07T20:25:42.4497224Z 2025-05-07T20:25:42.4497229Z 2025-05-07T20:25:42.5482223Z cuda-nvdisasm-12.6.7 | 47.6 MB | ######1 | 61%  2025-05-07T20:25:42.5482617Z 2025-05-07T20:25:42.5482621Z 2025-05-07T20:25:42.5482624Z 2025-05-07T20:25:42.5482641Z 2025-05-07T20:25:42.5482645Z 2025-05-07T20:25:42.5482649Z 2025-05-07T20:25:42.5482652Z 2025-05-07T20:25:42.5482656Z 2025-05-07T20:25:42.5483984Z 2025-05-07T20:25:42.5579491Z libcurand-10.3.7.77 | 39.9 MB | #4 | 14%  2025-05-07T20:25:42.5579992Z 2025-05-07T20:25:42.5579996Z 2025-05-07T20:25:42.5580000Z 2025-05-07T20:25:42.5580003Z 2025-05-07T20:25:42.5580007Z 2025-05-07T20:25:42.5580011Z 2025-05-07T20:25:42.5580023Z 2025-05-07T20:25:42.5580027Z 2025-05-07T20:25:42.6483087Z cuda-nvdisasm-12.6.7 | 47.6 MB | ######7 | 68%  2025-05-07T20:25:42.6483390Z 2025-05-07T20:25:42.6483394Z 2025-05-07T20:25:42.6483398Z 2025-05-07T20:25:42.6483409Z 2025-05-07T20:25:42.6483413Z 2025-05-07T20:25:42.6483417Z 2025-05-07T20:25:42.6483421Z 2025-05-07T20:25:42.6483424Z 2025-05-07T20:25:42.6485223Z 2025-05-07T20:25:42.6627024Z libcurand-10.3.7.77 | 39.9 MB | ##3 | 23%  2025-05-07T20:25:42.6627531Z 2025-05-07T20:25:42.6627549Z 2025-05-07T20:25:42.6627554Z 2025-05-07T20:25:42.6627559Z 2025-05-07T20:25:42.6627565Z 2025-05-07T20:25:42.6627580Z 2025-05-07T20:25:42.6627585Z 2025-05-07T20:25:42.6627591Z 2025-05-07T20:25:42.7485373Z cuda-nvdisasm-12.6.7 | 47.6 MB | #######4 | 74%  2025-05-07T20:25:42.7485780Z 2025-05-07T20:25:42.7485786Z 2025-05-07T20:25:42.7485790Z 2025-05-07T20:25:42.7485794Z 2025-05-07T20:25:42.7485797Z 2025-05-07T20:25:42.7485801Z 2025-05-07T20:25:42.7485805Z 2025-05-07T20:25:42.7485808Z 2025-05-07T20:25:42.7487047Z 2025-05-07T20:25:42.7723552Z libcurand-10.3.7.77 | 39.9 MB | ###1 | 32%  2025-05-07T20:25:42.7724000Z 2025-05-07T20:25:42.7724004Z 2025-05-07T20:25:42.7724008Z 2025-05-07T20:25:42.7724012Z 2025-05-07T20:25:42.7724016Z 2025-05-07T20:25:42.7724020Z 2025-05-07T20:25:42.7724023Z 2025-05-07T20:25:42.7724027Z 2025-05-07T20:25:42.8503977Z cuda-nvdisasm-12.6.7 | 47.6 MB | ######## | 81%  2025-05-07T20:25:42.8504704Z 2025-05-07T20:25:42.8504710Z 2025-05-07T20:25:42.8504715Z 2025-05-07T20:25:42.8504720Z 2025-05-07T20:25:42.8504922Z 2025-05-07T20:25:42.8504929Z 2025-05-07T20:25:42.8504934Z 2025-05-07T20:25:42.8504939Z 2025-05-07T20:25:42.8507302Z 2025-05-07T20:25:42.8784264Z libcurand-10.3.7.77 | 39.9 MB | ###9 | 40%  2025-05-07T20:25:42.8784600Z 2025-05-07T20:25:42.8784605Z 2025-05-07T20:25:42.8784610Z 2025-05-07T20:25:42.8784624Z 2025-05-07T20:25:42.8784629Z 2025-05-07T20:25:42.8784634Z 2025-05-07T20:25:42.8784639Z 2025-05-07T20:25:42.8786633Z 2025-05-07T20:25:42.9513963Z cuda-nvdisasm-12.6.7 | 47.6 MB | ########6 | 87%  2025-05-07T20:25:42.9514281Z 2025-05-07T20:25:42.9514285Z 2025-05-07T20:25:42.9514289Z 2025-05-07T20:25:42.9514292Z 2025-05-07T20:25:42.9514296Z 2025-05-07T20:25:42.9514300Z 2025-05-07T20:25:42.9514304Z 2025-05-07T20:25:42.9514308Z 2025-05-07T20:25:42.9515888Z 2025-05-07T20:25:42.9785373Z libcurand-10.3.7.77 | 39.9 MB | ####7 | 48%  2025-05-07T20:25:42.9785669Z 2025-05-07T20:25:42.9785683Z 2025-05-07T20:25:42.9785687Z 2025-05-07T20:25:42.9785691Z 2025-05-07T20:25:42.9785695Z 2025-05-07T20:25:42.9785699Z 2025-05-07T20:25:42.9785703Z 2025-05-07T20:25:42.9787084Z 2025-05-07T20:25:43.0516081Z cuda-nvdisasm-12.6.7 | 47.6 MB | #########3 | 94%  2025-05-07T20:25:43.0516466Z 2025-05-07T20:25:43.0516470Z 2025-05-07T20:25:43.0516474Z 2025-05-07T20:25:43.0516478Z 2025-05-07T20:25:43.0516482Z 2025-05-07T20:25:43.0516486Z 2025-05-07T20:25:43.0516490Z 2025-05-07T20:25:43.0516494Z 2025-05-07T20:25:43.0517304Z 2025-05-07T20:25:43.1516613Z libcurand-10.3.7.77 | 39.9 MB | #####6 | 56%  2025-05-07T20:25:43.1517032Z 2025-05-07T20:25:43.1517036Z 2025-05-07T20:25:43.1517040Z 2025-05-07T20:25:43.1517044Z 2025-05-07T20:25:43.1517048Z 2025-05-07T20:25:43.1517067Z 2025-05-07T20:25:43.1517079Z 2025-05-07T20:25:43.1517083Z 2025-05-07T20:25:43.1518849Z 2025-05-07T20:25:43.2519369Z libcurand-10.3.7.77 | 39.9 MB | ######5 | 65%  2025-05-07T20:25:43.2519766Z 2025-05-07T20:25:43.2519774Z 2025-05-07T20:25:43.2519778Z 2025-05-07T20:25:43.2519782Z 2025-05-07T20:25:43.2519786Z 2025-05-07T20:25:43.2519789Z 2025-05-07T20:25:43.2519793Z 2025-05-07T20:25:43.2519821Z 2025-05-07T20:25:43.2519825Z 2025-05-07T20:25:43.3529164Z libcurand-10.3.7.77 | 39.9 MB | #######3 | 74%  2025-05-07T20:25:43.3529475Z 2025-05-07T20:25:43.3529479Z 2025-05-07T20:25:43.3529483Z 2025-05-07T20:25:43.3529487Z 2025-05-07T20:25:43.3529491Z 2025-05-07T20:25:43.3529495Z 2025-05-07T20:25:43.3529499Z 2025-05-07T20:25:43.3529503Z 2025-05-07T20:25:43.3532438Z 2025-05-07T20:25:43.4529403Z libcurand-10.3.7.77 | 39.9 MB | ########3 | 83%  2025-05-07T20:25:43.4529706Z 2025-05-07T20:25:43.4529710Z 2025-05-07T20:25:43.4529724Z 2025-05-07T20:25:43.4529728Z 2025-05-07T20:25:43.4529732Z 2025-05-07T20:25:43.4529736Z 2025-05-07T20:25:43.4529740Z 2025-05-07T20:25:43.4529751Z 2025-05-07T20:25:43.4529755Z 2025-05-07T20:25:43.4749982Z libcurand-10.3.7.77 | 39.9 MB | #########2 | 92%  2025-05-07T20:25:43.4750274Z 2025-05-07T20:25:43.4750278Z 2025-05-07T20:25:43.4753033Z 2025-05-07T20:25:44.2735527Z libcusparse-12.5.4.2 | 118.6 MB | ########## | 100%  2025-05-07T20:25:44.2735817Z 2025-05-07T20:25:44.2735822Z 2025-05-07T20:25:44.2735825Z 2025-05-07T20:25:44.2735829Z 2025-05-07T20:25:44.2735833Z 2025-05-07T20:25:44.2735845Z 2025-05-07T20:25:44.2736397Z 2025-05-07T20:25:44.3220728Z libnpp-12.3.1.54 | 93.4 MB | ########## | 100%  2025-05-07T20:25:44.3221127Z 2025-05-07T20:25:44.3221140Z 2025-05-07T20:25:44.3221146Z 2025-05-07T20:25:44.3221151Z 2025-05-07T20:25:44.3221155Z 2025-05-07T20:25:44.3221159Z 2025-05-07T20:25:44.3221394Z 2025-05-07T20:25:44.3221398Z 2025-05-07T20:25:44.3221401Z 2025-05-07T20:25:44.3221405Z 2025-05-07T20:25:44.4224308Z gds-tools-1.11.1.6 | 37.8 MB | | 0%  2025-05-07T20:25:44.4224629Z 2025-05-07T20:25:44.4224633Z 2025-05-07T20:25:44.4224637Z 2025-05-07T20:25:44.4224641Z 2025-05-07T20:25:44.4224645Z 2025-05-07T20:25:44.4224648Z 2025-05-07T20:25:44.4224652Z 2025-05-07T20:25:44.4224656Z 2025-05-07T20:25:44.4224660Z 2025-05-07T20:25:44.4224664Z 2025-05-07T20:25:44.5225738Z gds-tools-1.11.1.6 | 37.8 MB | 8 | 9%  2025-05-07T20:25:44.5226111Z 2025-05-07T20:25:44.5226115Z 2025-05-07T20:25:44.5226119Z 2025-05-07T20:25:44.5226123Z 2025-05-07T20:25:44.5226127Z 2025-05-07T20:25:44.5226131Z 2025-05-07T20:25:44.5226134Z 2025-05-07T20:25:44.5226138Z 2025-05-07T20:25:44.5226142Z 2025-05-07T20:25:44.5226146Z 2025-05-07T20:25:44.6323799Z gds-tools-1.11.1.6 | 37.8 MB | #7 | 18%  2025-05-07T20:25:44.6324180Z 2025-05-07T20:25:44.6324184Z 2025-05-07T20:25:44.6324188Z 2025-05-07T20:25:44.6324192Z 2025-05-07T20:25:44.6324196Z 2025-05-07T20:25:44.6324209Z 2025-05-07T20:25:44.6324213Z 2025-05-07T20:25:44.6324224Z 2025-05-07T20:25:44.6324228Z 2025-05-07T20:25:44.6324232Z 2025-05-07T20:25:44.6434820Z gds-tools-1.11.1.6 | 37.8 MB | ##6 | 27%  2025-05-07T20:25:44.6439352Z 2025-05-07T20:25:44.6882102Z libcublas-12.6.4.1 | 256.2 MB | ########## | 100%  2025-05-07T20:25:44.6882486Z 2025-05-07T20:25:44.6882491Z 2025-05-07T20:25:44.6882496Z 2025-05-07T20:25:44.6882501Z 2025-05-07T20:25:44.6882506Z 2025-05-07T20:25:44.6882512Z 2025-05-07T20:25:44.6882529Z 2025-05-07T20:25:44.6886624Z 2025-05-07T20:25:44.6923483Z cuda-nvdisasm-12.6.7 | 47.6 MB | ########## | 100%  2025-05-07T20:25:44.6923817Z 2025-05-07T20:25:44.6923832Z 2025-05-07T20:25:44.6923837Z 2025-05-07T20:25:44.6923842Z 2025-05-07T20:25:44.6923865Z 2025-05-07T20:25:44.6923870Z 2025-05-07T20:25:44.6923875Z 2025-05-07T20:25:44.6923880Z 2025-05-07T20:25:44.6923886Z 2025-05-07T20:25:44.6923891Z 2025-05-07T20:25:44.6928463Z 2025-05-07T20:25:44.7284634Z python-3.10.13 | 24.5 MB | | 0%  2025-05-07T20:25:44.7284937Z 2025-05-07T20:25:44.7284941Z 2025-05-07T20:25:44.7284945Z 2025-05-07T20:25:44.7284949Z 2025-05-07T20:25:44.7284953Z 2025-05-07T20:25:44.7284957Z 2025-05-07T20:25:44.7284961Z 2025-05-07T20:25:44.7284964Z 2025-05-07T20:25:44.7284968Z 2025-05-07T20:25:44.7284972Z 2025-05-07T20:25:44.7284976Z 2025-05-07T20:25:44.7286594Z 2025-05-07T20:25:44.7409789Z cuda-nvcc-tools-12.6 | 23.0 MB | | 0%  2025-05-07T20:25:44.7410207Z 2025-05-07T20:25:44.7410212Z 2025-05-07T20:25:44.7410216Z 2025-05-07T20:25:44.7410219Z 2025-05-07T20:25:44.7410223Z 2025-05-07T20:25:44.7410227Z 2025-05-07T20:25:44.7410230Z 2025-05-07T20:25:44.7410234Z 2025-05-07T20:25:44.7410254Z 2025-05-07T20:25:44.7414332Z 2025-05-07T20:25:44.7934412Z gds-tools-1.11.1.6 | 37.8 MB | ###5 | 35%  2025-05-07T20:25:44.7934715Z 2025-05-07T20:25:44.7934727Z 2025-05-07T20:25:44.7934731Z 2025-05-07T20:25:44.7934734Z 2025-05-07T20:25:44.7934738Z 2025-05-07T20:25:44.7934742Z 2025-05-07T20:25:44.7934745Z 2025-05-07T20:25:44.7934749Z 2025-05-07T20:25:44.7934753Z 2025-05-07T20:25:44.7934757Z 2025-05-07T20:25:44.7939775Z 2025-05-07T20:25:44.8287165Z python-3.10.13 | 24.5 MB | #1 | 11%  2025-05-07T20:25:44.8287504Z 2025-05-07T20:25:44.8287508Z 2025-05-07T20:25:44.8287512Z 2025-05-07T20:25:44.8287515Z 2025-05-07T20:25:44.8287519Z 2025-05-07T20:25:44.8287523Z 2025-05-07T20:25:44.8287526Z 2025-05-07T20:25:44.8287530Z 2025-05-07T20:25:44.8287534Z 2025-05-07T20:25:44.8287538Z 2025-05-07T20:25:44.8287541Z 2025-05-07T20:25:44.8291042Z 2025-05-07T20:25:44.8491286Z cuda-nvcc-tools-12.6 | 23.0 MB | #1 | 11%  2025-05-07T20:25:44.8491958Z 2025-05-07T20:25:44.8491965Z 2025-05-07T20:25:44.8491970Z 2025-05-07T20:25:44.8492477Z 2025-05-07T20:25:44.8492486Z 2025-05-07T20:25:44.8492491Z 2025-05-07T20:25:44.8492496Z 2025-05-07T20:25:44.8492501Z 2025-05-07T20:25:44.8492507Z 2025-05-07T20:25:44.8492512Z 2025-05-07T20:25:44.8779718Z gds-tools-1.11.1.6 | 37.8 MB | ####3 | 44%  2025-05-07T20:25:44.8780124Z 2025-05-07T20:25:44.8780128Z 2025-05-07T20:25:44.8780132Z 2025-05-07T20:25:44.8780136Z 2025-05-07T20:25:44.8780139Z 2025-05-07T20:25:44.8780150Z 2025-05-07T20:25:44.8780154Z 2025-05-07T20:25:44.8780158Z 2025-05-07T20:25:44.8787157Z 2025-05-07T20:25:44.8979178Z libcurand-10.3.7.77 | 39.9 MB | ########## | 100%  2025-05-07T20:25:44.8979485Z 2025-05-07T20:25:44.8979489Z 2025-05-07T20:25:44.8979493Z 2025-05-07T20:25:44.8979497Z 2025-05-07T20:25:44.8979500Z 2025-05-07T20:25:44.8979521Z 2025-05-07T20:25:44.8979524Z 2025-05-07T20:25:44.8979528Z 2025-05-07T20:25:44.8979532Z 2025-05-07T20:25:44.8979536Z 2025-05-07T20:25:44.8982278Z 2025-05-07T20:25:44.9352375Z python-3.10.13 | 24.5 MB | ##2 | 23%  2025-05-07T20:25:44.9352705Z 2025-05-07T20:25:44.9352711Z 2025-05-07T20:25:44.9352716Z 2025-05-07T20:25:44.9352721Z 2025-05-07T20:25:44.9352726Z 2025-05-07T20:25:44.9352732Z 2025-05-07T20:25:44.9352737Z 2025-05-07T20:25:44.9352742Z 2025-05-07T20:25:44.9352747Z 2025-05-07T20:25:44.9352752Z 2025-05-07T20:25:44.9352757Z 2025-05-07T20:25:44.9356610Z 2025-05-07T20:25:44.9374497Z cuda-nvcc-tools-12.6 | 23.0 MB | ##2 | 23%  2025-05-07T20:25:44.9374816Z 2025-05-07T20:25:44.9374821Z 2025-05-07T20:25:44.9374833Z 2025-05-07T20:25:44.9374837Z 2025-05-07T20:25:44.9374841Z 2025-05-07T20:25:44.9374844Z 2025-05-07T20:25:44.9374848Z 2025-05-07T20:25:44.9374852Z 2025-05-07T20:25:44.9374866Z 2025-05-07T20:25:44.9374871Z 2025-05-07T20:25:44.9374875Z 2025-05-07T20:25:44.9374878Z 2025-05-07T20:25:44.9378150Z 2025-05-07T20:25:44.9543073Z cuda-nvrtc-12.6.85 | 17.3 MB | | 0%  2025-05-07T20:25:44.9543440Z 2025-05-07T20:25:44.9543446Z 2025-05-07T20:25:44.9543451Z 2025-05-07T20:25:44.9543456Z 2025-05-07T20:25:44.9543461Z 2025-05-07T20:25:44.9543466Z 2025-05-07T20:25:44.9543471Z 2025-05-07T20:25:44.9543476Z 2025-05-07T20:25:44.9543481Z 2025-05-07T20:25:44.9543487Z 2025-05-07T20:25:45.0058339Z gds-tools-1.11.1.6 | 37.8 MB | #####1 | 52%  2025-05-07T20:25:45.0058639Z 2025-05-07T20:25:45.0058644Z 2025-05-07T20:25:45.0058647Z 2025-05-07T20:25:45.0058651Z 2025-05-07T20:25:45.0058655Z 2025-05-07T20:25:45.0058658Z 2025-05-07T20:25:45.0058662Z 2025-05-07T20:25:45.0058666Z 2025-05-07T20:25:45.0058669Z 2025-05-07T20:25:45.0058673Z 2025-05-07T20:25:45.0062771Z 2025-05-07T20:25:45.0355333Z python-3.10.13 | 24.5 MB | ###3 | 34%  2025-05-07T20:25:45.0355705Z 2025-05-07T20:25:45.0355711Z 2025-05-07T20:25:45.0355740Z 2025-05-07T20:25:45.0355746Z 2025-05-07T20:25:45.0355751Z 2025-05-07T20:25:45.0355757Z 2025-05-07T20:25:45.0355762Z 2025-05-07T20:25:45.0355767Z 2025-05-07T20:25:45.0355772Z 2025-05-07T20:25:45.0355777Z 2025-05-07T20:25:45.0355782Z 2025-05-07T20:25:45.0362491Z 2025-05-07T20:25:45.0379272Z cuda-nvcc-tools-12.6 | 23.0 MB | ###4 | 34%  2025-05-07T20:25:45.0379598Z 2025-05-07T20:25:45.0379603Z 2025-05-07T20:25:45.0379607Z 2025-05-07T20:25:45.0379611Z 2025-05-07T20:25:45.0379615Z 2025-05-07T20:25:45.0379619Z 2025-05-07T20:25:45.0379623Z 2025-05-07T20:25:45.0379627Z 2025-05-07T20:25:45.0379630Z 2025-05-07T20:25:45.0379634Z 2025-05-07T20:25:45.0379638Z 2025-05-07T20:25:45.0379641Z 2025-05-07T20:25:45.0379740Z 2025-05-07T20:25:45.0762195Z cuda-nvrtc-12.6.85 | 17.3 MB | #3 | 14%  2025-05-07T20:25:45.0762915Z 2025-05-07T20:25:45.0762920Z 2025-05-07T20:25:45.0762924Z 2025-05-07T20:25:45.0763063Z 2025-05-07T20:25:45.0763067Z 2025-05-07T20:25:45.0763071Z 2025-05-07T20:25:45.0763075Z 2025-05-07T20:25:45.0763078Z 2025-05-07T20:25:45.0763082Z 2025-05-07T20:25:45.0763094Z 2025-05-07T20:25:45.1183764Z gds-tools-1.11.1.6 | 37.8 MB | #####9 | 60%  2025-05-07T20:25:45.1184066Z 2025-05-07T20:25:45.1184070Z 2025-05-07T20:25:45.1184074Z 2025-05-07T20:25:45.1184084Z 2025-05-07T20:25:45.1184088Z 2025-05-07T20:25:45.1184092Z 2025-05-07T20:25:45.1184096Z 2025-05-07T20:25:45.1184100Z 2025-05-07T20:25:45.1184104Z 2025-05-07T20:25:45.1184107Z 2025-05-07T20:25:45.1186592Z 2025-05-07T20:25:45.1357564Z python-3.10.13 | 24.5 MB | ####4 | 44%  2025-05-07T20:25:45.1357867Z 2025-05-07T20:25:45.1357871Z 2025-05-07T20:25:45.1357875Z 2025-05-07T20:25:45.1357892Z 2025-05-07T20:25:45.1357896Z 2025-05-07T20:25:45.1357900Z 2025-05-07T20:25:45.1357904Z 2025-05-07T20:25:45.1357908Z 2025-05-07T20:25:45.1357911Z 2025-05-07T20:25:45.1357921Z 2025-05-07T20:25:45.1357925Z 2025-05-07T20:25:45.1361571Z 2025-05-07T20:25:45.1382652Z cuda-nvcc-tools-12.6 | 23.0 MB | ####5 | 46%  2025-05-07T20:25:45.1382971Z 2025-05-07T20:25:45.1382975Z 2025-05-07T20:25:45.1382979Z 2025-05-07T20:25:45.1382983Z 2025-05-07T20:25:45.1382986Z 2025-05-07T20:25:45.1382990Z 2025-05-07T20:25:45.1382994Z 2025-05-07T20:25:45.1382998Z 2025-05-07T20:25:45.1383002Z 2025-05-07T20:25:45.1383005Z 2025-05-07T20:25:45.1383009Z 2025-05-07T20:25:45.1383013Z 2025-05-07T20:25:45.1383020Z 2025-05-07T20:25:45.1766461Z cuda-nvrtc-12.6.85 | 17.3 MB | ##8 | 28%  2025-05-07T20:25:45.1766760Z 2025-05-07T20:25:45.1766764Z 2025-05-07T20:25:45.1766768Z 2025-05-07T20:25:45.1766771Z 2025-05-07T20:25:45.1766786Z 2025-05-07T20:25:45.1766790Z 2025-05-07T20:25:45.1766794Z 2025-05-07T20:25:45.1766797Z 2025-05-07T20:25:45.1766801Z 2025-05-07T20:25:45.1766804Z 2025-05-07T20:25:45.2185496Z gds-tools-1.11.1.6 | 37.8 MB | ######7 | 68%  2025-05-07T20:25:45.2185801Z 2025-05-07T20:25:45.2185805Z 2025-05-07T20:25:45.2185809Z 2025-05-07T20:25:45.2185813Z 2025-05-07T20:25:45.2185817Z 2025-05-07T20:25:45.2185829Z 2025-05-07T20:25:45.2185833Z 2025-05-07T20:25:45.2185837Z 2025-05-07T20:25:45.2185841Z 2025-05-07T20:25:45.2185845Z 2025-05-07T20:25:45.2187600Z 2025-05-07T20:25:45.2366544Z python-3.10.13 | 24.5 MB | #####4 | 54%  2025-05-07T20:25:45.2366839Z 2025-05-07T20:25:45.2366844Z 2025-05-07T20:25:45.2366848Z 2025-05-07T20:25:45.2366852Z 2025-05-07T20:25:45.2366856Z 2025-05-07T20:25:45.2366860Z 2025-05-07T20:25:45.2366864Z 2025-05-07T20:25:45.2366867Z 2025-05-07T20:25:45.2366871Z 2025-05-07T20:25:45.2366875Z 2025-05-07T20:25:45.2366886Z 2025-05-07T20:25:45.2368293Z 2025-05-07T20:25:45.2386071Z cuda-nvcc-tools-12.6 | 23.0 MB | #####8 | 58%  2025-05-07T20:25:45.2386410Z 2025-05-07T20:25:45.2386416Z 2025-05-07T20:25:45.2386422Z 2025-05-07T20:25:45.2386427Z 2025-05-07T20:25:45.2386432Z 2025-05-07T20:25:45.2386437Z 2025-05-07T20:25:45.2386442Z 2025-05-07T20:25:45.2386447Z 2025-05-07T20:25:45.2386452Z 2025-05-07T20:25:45.2386457Z 2025-05-07T20:25:45.2386462Z 2025-05-07T20:25:45.2386467Z 2025-05-07T20:25:45.2388854Z 2025-05-07T20:25:45.2904728Z cuda-nvrtc-12.6.85 | 17.3 MB | ####2 | 43%  2025-05-07T20:25:45.2905158Z 2025-05-07T20:25:45.2905163Z 2025-05-07T20:25:45.2905180Z 2025-05-07T20:25:45.2905186Z 2025-05-07T20:25:45.2905191Z 2025-05-07T20:25:45.2905196Z 2025-05-07T20:25:45.2905201Z 2025-05-07T20:25:45.2905206Z 2025-05-07T20:25:45.2905211Z 2025-05-07T20:25:45.2905215Z 2025-05-07T20:25:45.3267087Z gds-tools-1.11.1.6 | 37.8 MB | #######5 | 75%  2025-05-07T20:25:45.3267636Z 2025-05-07T20:25:45.3267641Z 2025-05-07T20:25:45.3267773Z 2025-05-07T20:25:45.3267778Z 2025-05-07T20:25:45.3267781Z 2025-05-07T20:25:45.3267785Z 2025-05-07T20:25:45.3267789Z 2025-05-07T20:25:45.3267793Z 2025-05-07T20:25:45.3267797Z 2025-05-07T20:25:45.3267800Z 2025-05-07T20:25:45.3269680Z 2025-05-07T20:25:45.3389245Z python-3.10.13 | 24.5 MB | ######4 | 64%  2025-05-07T20:25:45.3389564Z 2025-05-07T20:25:45.3389569Z 2025-05-07T20:25:45.3389574Z 2025-05-07T20:25:45.3389577Z 2025-05-07T20:25:45.3389582Z 2025-05-07T20:25:45.3389585Z 2025-05-07T20:25:45.3389589Z 2025-05-07T20:25:45.3389593Z 2025-05-07T20:25:45.3389597Z 2025-05-07T20:25:45.3389600Z 2025-05-07T20:25:45.3389604Z 2025-05-07T20:25:45.3389608Z 2025-05-07T20:25:45.3393061Z 2025-05-07T20:25:45.3495146Z cuda-nvrtc-12.6.85 | 17.3 MB | #####8 | 58%  2025-05-07T20:25:45.3495536Z 2025-05-07T20:25:45.3495540Z 2025-05-07T20:25:45.3495544Z 2025-05-07T20:25:45.3495547Z 2025-05-07T20:25:45.3495551Z 2025-05-07T20:25:45.3495568Z 2025-05-07T20:25:45.3495572Z 2025-05-07T20:25:45.3495576Z 2025-05-07T20:25:45.3495580Z 2025-05-07T20:25:45.3495583Z 2025-05-07T20:25:45.3495587Z 2025-05-07T20:25:45.3495591Z 2025-05-07T20:25:45.3914813Z cuda-nvcc-tools-12.6 | 23.0 MB | ######9 | 70%  2025-05-07T20:25:45.3915143Z 2025-05-07T20:25:45.3915147Z 2025-05-07T20:25:45.3915151Z 2025-05-07T20:25:45.3915154Z 2025-05-07T20:25:45.3915158Z 2025-05-07T20:25:45.3915162Z 2025-05-07T20:25:45.3915165Z 2025-05-07T20:25:45.3915169Z 2025-05-07T20:25:45.3915173Z 2025-05-07T20:25:45.3915176Z 2025-05-07T20:25:45.4374853Z gds-tools-1.11.1.6 | 37.8 MB | ########2 | 83%  2025-05-07T20:25:45.4375163Z 2025-05-07T20:25:45.4375168Z 2025-05-07T20:25:45.4375172Z 2025-05-07T20:25:45.4375176Z 2025-05-07T20:25:45.4375190Z 2025-05-07T20:25:45.4375193Z 2025-05-07T20:25:45.4375197Z 2025-05-07T20:25:45.4375201Z 2025-05-07T20:25:45.4375204Z 2025-05-07T20:25:45.4375214Z 2025-05-07T20:25:45.4377233Z 2025-05-07T20:25:45.4389605Z python-3.10.13 | 24.5 MB | #######4 | 74%  2025-05-07T20:25:45.4390164Z 2025-05-07T20:25:45.4390171Z 2025-05-07T20:25:45.4390176Z 2025-05-07T20:25:45.4390181Z 2025-05-07T20:25:45.4390186Z 2025-05-07T20:25:45.4390191Z 2025-05-07T20:25:45.4390197Z 2025-05-07T20:25:45.4390201Z 2025-05-07T20:25:45.4390212Z 2025-05-07T20:25:45.4390216Z 2025-05-07T20:25:45.4390220Z 2025-05-07T20:25:45.4390224Z 2025-05-07T20:25:45.4392060Z 2025-05-07T20:25:45.4497545Z cuda-nvrtc-12.6.85 | 17.3 MB | #######3 | 74%  2025-05-07T20:25:45.4497847Z 2025-05-07T20:25:45.4497858Z 2025-05-07T20:25:45.4497862Z 2025-05-07T20:25:45.4497866Z 2025-05-07T20:25:45.4497870Z 2025-05-07T20:25:45.4497874Z 2025-05-07T20:25:45.4497887Z 2025-05-07T20:25:45.4497891Z 2025-05-07T20:25:45.4497895Z 2025-05-07T20:25:45.4497898Z 2025-05-07T20:25:45.4497902Z 2025-05-07T20:25:45.4505376Z 2025-05-07T20:25:45.5010930Z cuda-nvcc-tools-12.6 | 23.0 MB | ########1 | 81%  2025-05-07T20:25:45.5011326Z 2025-05-07T20:25:45.5011332Z 2025-05-07T20:25:45.5011337Z 2025-05-07T20:25:45.5011342Z 2025-05-07T20:25:45.5011347Z 2025-05-07T20:25:45.5011353Z 2025-05-07T20:25:45.5011358Z 2025-05-07T20:25:45.5011363Z 2025-05-07T20:25:45.5011368Z 2025-05-07T20:25:45.5011374Z 2025-05-07T20:25:45.5382592Z gds-tools-1.11.1.6 | 37.8 MB | ######### | 90%  2025-05-07T20:25:45.5382891Z 2025-05-07T20:25:45.5382896Z 2025-05-07T20:25:45.5382900Z 2025-05-07T20:25:45.5382904Z 2025-05-07T20:25:45.5382915Z 2025-05-07T20:25:45.5382918Z 2025-05-07T20:25:45.5382922Z 2025-05-07T20:25:45.5382926Z 2025-05-07T20:25:45.5382930Z 2025-05-07T20:25:45.5382933Z 2025-05-07T20:25:45.5384482Z 2025-05-07T20:25:45.5478126Z python-3.10.13 | 24.5 MB | ########5 | 85%  2025-05-07T20:25:45.5478535Z 2025-05-07T20:25:45.5478754Z 2025-05-07T20:25:45.5478760Z 2025-05-07T20:25:45.5478764Z 2025-05-07T20:25:45.5478767Z 2025-05-07T20:25:45.5478771Z 2025-05-07T20:25:45.5478775Z 2025-05-07T20:25:45.5478779Z 2025-05-07T20:25:45.5478783Z 2025-05-07T20:25:45.5478787Z 2025-05-07T20:25:45.5478790Z 2025-05-07T20:25:45.5478794Z 2025-05-07T20:25:45.5487728Z 2025-05-07T20:25:45.5541942Z cuda-nvrtc-12.6.85 | 17.3 MB | ########8 | 89%  2025-05-07T20:25:45.5542257Z 2025-05-07T20:25:45.5542261Z 2025-05-07T20:25:45.5542264Z 2025-05-07T20:25:45.5542268Z 2025-05-07T20:25:45.5542272Z 2025-05-07T20:25:45.5542276Z 2025-05-07T20:25:45.5542280Z 2025-05-07T20:25:45.5542284Z 2025-05-07T20:25:45.5542287Z 2025-05-07T20:25:45.5542291Z 2025-05-07T20:25:45.5542295Z 2025-05-07T20:25:45.5542307Z 2025-05-07T20:25:45.6047093Z cuda-nvcc-tools-12.6 | 23.0 MB | #########2 | 93%  2025-05-07T20:25:45.6047481Z 2025-05-07T20:25:45.6047485Z 2025-05-07T20:25:45.6047501Z 2025-05-07T20:25:45.6047513Z 2025-05-07T20:25:45.6047517Z 2025-05-07T20:25:45.6047521Z 2025-05-07T20:25:45.6047525Z 2025-05-07T20:25:45.6047529Z 2025-05-07T20:25:45.6047533Z 2025-05-07T20:25:45.6048903Z 2025-05-07T20:25:45.6384790Z gds-tools-1.11.1.6 | 37.8 MB | #########7 | 97%  2025-05-07T20:25:45.6385237Z 2025-05-07T20:25:45.6385243Z 2025-05-07T20:25:45.6385248Z 2025-05-07T20:25:45.6385253Z 2025-05-07T20:25:45.6385258Z 2025-05-07T20:25:45.6385263Z 2025-05-07T20:25:45.6385269Z 2025-05-07T20:25:45.6385274Z 2025-05-07T20:25:45.6385279Z 2025-05-07T20:25:45.6385284Z 2025-05-07T20:25:45.6386847Z 2025-05-07T20:25:46.0918346Z python-3.10.13 | 24.5 MB | #########6 | 97%  2025-05-07T20:25:46.0918805Z 2025-05-07T20:25:46.0919395Z 2025-05-07T20:25:46.2301262Z libcufft-11.3.0.4 | 156.2 MB | ########## | 100%  2025-05-07T20:25:46.2301643Z 2025-05-07T20:25:46.2301649Z 2025-05-07T20:25:46.2301689Z 2025-05-07T20:25:46.2301695Z 2025-05-07T20:25:46.2301701Z 2025-05-07T20:25:46.2301706Z 2025-05-07T20:25:46.2301711Z 2025-05-07T20:25:46.2301716Z 2025-05-07T20:25:46.2301721Z 2025-05-07T20:25:46.2301726Z 2025-05-07T20:25:46.2301741Z 2025-05-07T20:25:46.2301745Z 2025-05-07T20:25:46.2305419Z 2025-05-07T20:25:46.2962158Z cuda-nvrtc-12.6.85 | 17.3 MB | ########## | 100%  2025-05-07T20:25:46.2962460Z 2025-05-07T20:25:46.2962472Z 2025-05-07T20:25:46.2962476Z 2025-05-07T20:25:46.2962480Z 2025-05-07T20:25:46.2962483Z 2025-05-07T20:25:46.2962487Z 2025-05-07T20:25:46.2962491Z 2025-05-07T20:25:46.2962495Z 2025-05-07T20:25:46.2962499Z 2025-05-07T20:25:46.2962503Z 2025-05-07T20:25:46.2962507Z 2025-05-07T20:25:46.2962510Z 2025-05-07T20:25:46.2962514Z 2025-05-07T20:25:46.2964122Z 2025-05-07T20:25:46.3773174Z libnvjitlink-12.6.85 | 14.9 MB | | 0%  2025-05-07T20:25:46.3773619Z 2025-05-07T20:25:46.3773648Z 2025-05-07T20:25:46.3773653Z 2025-05-07T20:25:46.3773659Z 2025-05-07T20:25:46.3773664Z 2025-05-07T20:25:46.3773669Z 2025-05-07T20:25:46.3773673Z 2025-05-07T20:25:46.3773676Z 2025-05-07T20:25:46.3773680Z 2025-05-07T20:25:46.3773684Z 2025-05-07T20:25:46.3773688Z 2025-05-07T20:25:46.3776081Z 2025-05-07T20:25:46.3961402Z cuda-nvcc-tools-12.6 | 23.0 MB | ########## | 100%  2025-05-07T20:25:46.3961847Z 2025-05-07T20:25:46.3961852Z 2025-05-07T20:25:46.3961858Z 2025-05-07T20:25:46.3961863Z 2025-05-07T20:25:46.3961879Z 2025-05-07T20:25:46.3961884Z 2025-05-07T20:25:46.3961889Z 2025-05-07T20:25:46.3961895Z 2025-05-07T20:25:46.3961900Z 2025-05-07T20:25:46.3961906Z 2025-05-07T20:25:46.3961910Z 2025-05-07T20:25:46.4081265Z python-3.10.13 | 24.5 MB | ########## | 100%  2025-05-07T20:25:46.4081930Z 2025-05-07T20:25:46.4081936Z 2025-05-07T20:25:46.4081942Z 2025-05-07T20:25:46.4081947Z 2025-05-07T20:25:46.4081953Z 2025-05-07T20:25:46.4082113Z 2025-05-07T20:25:46.4082120Z 2025-05-07T20:25:46.4082125Z 2025-05-07T20:25:46.4082130Z 2025-05-07T20:25:46.4082135Z 2025-05-07T20:25:46.4082141Z 2025-05-07T20:25:46.4082146Z 2025-05-07T20:25:46.4082151Z 2025-05-07T20:25:46.4082156Z 2025-05-07T20:25:46.4082175Z 2025-05-07T20:25:46.4285930Z cuda-nvcc-dev_linux- | 10.8 MB | | 0%  2025-05-07T20:25:46.4286372Z 2025-05-07T20:25:46.4286378Z 2025-05-07T20:25:46.4286383Z 2025-05-07T20:25:46.4286388Z 2025-05-07T20:25:46.4286393Z 2025-05-07T20:25:46.4286398Z 2025-05-07T20:25:46.4286403Z 2025-05-07T20:25:46.4286408Z 2025-05-07T20:25:46.4286413Z 2025-05-07T20:25:46.4286428Z 2025-05-07T20:25:46.4286433Z 2025-05-07T20:25:46.4286438Z 2025-05-07T20:25:46.4286443Z 2025-05-07T20:25:46.4286448Z 2025-05-07T20:25:46.4286468Z 2025-05-07T20:25:46.4290975Z 2025-05-07T20:25:46.4470854Z cuda-nvvm-tools-12.6 | 10.4 MB | | 0%  2025-05-07T20:25:46.4471293Z 2025-05-07T20:25:46.4471297Z 2025-05-07T20:25:46.4471300Z 2025-05-07T20:25:46.4471304Z 2025-05-07T20:25:46.4471308Z 2025-05-07T20:25:46.4471311Z 2025-05-07T20:25:46.4471315Z 2025-05-07T20:25:46.4471318Z 2025-05-07T20:25:46.4471322Z 2025-05-07T20:25:46.4471325Z 2025-05-07T20:25:46.4471329Z 2025-05-07T20:25:46.4471333Z 2025-05-07T20:25:46.4471336Z 2025-05-07T20:25:46.4472765Z 2025-05-07T20:25:46.5083465Z libnvjitlink-12.6.85 | 14.9 MB | #6 | 17%  2025-05-07T20:25:46.5083909Z 2025-05-07T20:25:46.5083914Z 2025-05-07T20:25:46.5083919Z 2025-05-07T20:25:46.5083924Z 2025-05-07T20:25:46.5083930Z 2025-05-07T20:25:46.5083936Z 2025-05-07T20:25:46.5083941Z 2025-05-07T20:25:46.5083946Z 2025-05-07T20:25:46.5083951Z 2025-05-07T20:25:46.5083956Z 2025-05-07T20:25:46.5084004Z 2025-05-07T20:25:46.5084009Z 2025-05-07T20:25:46.5084014Z 2025-05-07T20:25:46.5084019Z 2025-05-07T20:25:46.5084029Z 2025-05-07T20:25:46.5288297Z cuda-nvcc-dev_linux- | 10.8 MB | ###4 | 34%  2025-05-07T20:25:46.5288771Z 2025-05-07T20:25:46.5288777Z 2025-05-07T20:25:46.5288782Z 2025-05-07T20:25:46.5288787Z 2025-05-07T20:25:46.5288792Z 2025-05-07T20:25:46.5288798Z 2025-05-07T20:25:46.5288803Z 2025-05-07T20:25:46.5288808Z 2025-05-07T20:25:46.5288813Z 2025-05-07T20:25:46.5288818Z 2025-05-07T20:25:46.5288823Z 2025-05-07T20:25:46.5288828Z 2025-05-07T20:25:46.5288833Z 2025-05-07T20:25:46.5288838Z 2025-05-07T20:25:46.5288843Z 2025-05-07T20:25:46.5288849Z 2025-05-07T20:25:46.5474846Z cuda-nvvm-tools-12.6 | 10.4 MB | ##8 | 28%  2025-05-07T20:25:46.5475209Z 2025-05-07T20:25:46.5475213Z 2025-05-07T20:25:46.5475217Z 2025-05-07T20:25:46.5475221Z 2025-05-07T20:25:46.5475240Z 2025-05-07T20:25:46.5475243Z 2025-05-07T20:25:46.5475247Z 2025-05-07T20:25:46.5475250Z 2025-05-07T20:25:46.5475254Z 2025-05-07T20:25:46.5475265Z 2025-05-07T20:25:46.5475275Z 2025-05-07T20:25:46.5475279Z 2025-05-07T20:25:46.5475283Z 2025-05-07T20:25:46.5478949Z 2025-05-07T20:25:46.6390984Z libnvjitlink-12.6.85 | 14.9 MB | ###3 | 34%  2025-05-07T20:25:46.6391348Z 2025-05-07T20:25:46.6391353Z 2025-05-07T20:25:46.6391357Z 2025-05-07T20:25:46.6391361Z 2025-05-07T20:25:46.6391364Z 2025-05-07T20:25:46.6391368Z 2025-05-07T20:25:46.6391372Z 2025-05-07T20:25:46.6391376Z 2025-05-07T20:25:46.6391387Z 2025-05-07T20:25:46.6391391Z 2025-05-07T20:25:46.6391394Z 2025-05-07T20:25:46.6391398Z 2025-05-07T20:25:46.6391401Z 2025-05-07T20:25:46.6391406Z 2025-05-07T20:25:46.6391409Z 2025-05-07T20:25:46.6392163Z 2025-05-07T20:25:46.6400666Z cuda-nvvm-tools-12.6 | 10.4 MB | #####6 | 56%  2025-05-07T20:25:46.6401345Z 2025-05-07T20:25:46.6401349Z 2025-05-07T20:25:46.6401353Z 2025-05-07T20:25:46.6401357Z 2025-05-07T20:25:46.6401361Z 2025-05-07T20:25:46.6401495Z 2025-05-07T20:25:46.6401500Z 2025-05-07T20:25:46.6401503Z 2025-05-07T20:25:46.6401507Z 2025-05-07T20:25:46.6401510Z 2025-05-07T20:25:46.6401514Z 2025-05-07T20:25:46.6401517Z 2025-05-07T20:25:46.6401521Z 2025-05-07T20:25:46.6401524Z 2025-05-07T20:25:46.6401528Z 2025-05-07T20:25:46.6476780Z cuda-nvcc-dev_linux- | 10.8 MB | ######8 | 69%  2025-05-07T20:25:46.6477136Z 2025-05-07T20:25:46.6477140Z 2025-05-07T20:25:46.6477143Z 2025-05-07T20:25:46.6477147Z 2025-05-07T20:25:46.6477151Z 2025-05-07T20:25:46.6477154Z 2025-05-07T20:25:46.6477158Z 2025-05-07T20:25:46.6477162Z 2025-05-07T20:25:46.6477166Z 2025-05-07T20:25:46.6477178Z 2025-05-07T20:25:46.6477182Z 2025-05-07T20:25:46.6477189Z 2025-05-07T20:25:46.6477193Z 2025-05-07T20:25:46.6478921Z 2025-05-07T20:25:46.7426489Z libnvjitlink-12.6.85 | 14.9 MB | #####2 | 53%  2025-05-07T20:25:46.7426991Z 2025-05-07T20:25:46.7426996Z 2025-05-07T20:25:46.7427017Z 2025-05-07T20:25:46.7427023Z 2025-05-07T20:25:46.7427028Z 2025-05-07T20:25:46.7427033Z 2025-05-07T20:25:46.7427039Z 2025-05-07T20:25:46.7427043Z 2025-05-07T20:25:46.7427049Z 2025-05-07T20:25:46.7427054Z 2025-05-07T20:25:46.7427059Z 2025-05-07T20:25:46.7427064Z 2025-05-07T20:25:46.7427069Z 2025-05-07T20:25:46.7427075Z 2025-05-07T20:25:46.7432613Z 2025-05-07T20:25:46.7445366Z cuda-nvcc-dev_linux- | 10.8 MB | #########8 | 98%  2025-05-07T20:25:46.7445690Z 2025-05-07T20:25:46.7445694Z 2025-05-07T20:25:46.7445698Z 2025-05-07T20:25:46.7445701Z 2025-05-07T20:25:46.7445705Z 2025-05-07T20:25:46.7445709Z 2025-05-07T20:25:46.7445712Z 2025-05-07T20:25:46.7445716Z 2025-05-07T20:25:46.7445719Z 2025-05-07T20:25:46.7445731Z 2025-05-07T20:25:46.7445735Z 2025-05-07T20:25:46.7445748Z 2025-05-07T20:25:46.7445752Z 2025-05-07T20:25:46.7445756Z 2025-05-07T20:25:46.7445760Z 2025-05-07T20:25:46.7447409Z 2025-05-07T20:25:46.7481616Z cuda-nvvm-tools-12.6 | 10.4 MB | ########2 | 83%  2025-05-07T20:25:46.7482009Z 2025-05-07T20:25:46.7482013Z 2025-05-07T20:25:46.7482017Z 2025-05-07T20:25:46.7482020Z 2025-05-07T20:25:46.7482024Z 2025-05-07T20:25:46.7482028Z 2025-05-07T20:25:46.7482032Z 2025-05-07T20:25:46.7482035Z 2025-05-07T20:25:46.7482039Z 2025-05-07T20:25:46.7482042Z 2025-05-07T20:25:46.7482046Z 2025-05-07T20:25:46.7482050Z 2025-05-07T20:25:46.7482054Z 2025-05-07T20:25:46.7482058Z 2025-05-07T20:25:46.8483238Z libnvjitlink-12.6.85 | 14.9 MB | #######1 | 71%  2025-05-07T20:25:46.8483694Z 2025-05-07T20:25:46.8483710Z 2025-05-07T20:25:46.8483716Z 2025-05-07T20:25:46.8483721Z 2025-05-07T20:25:46.8483726Z 2025-05-07T20:25:46.8483732Z 2025-05-07T20:25:46.8483737Z 2025-05-07T20:25:46.8483759Z 2025-05-07T20:25:46.8483763Z 2025-05-07T20:25:46.8483767Z 2025-05-07T20:25:46.8483771Z 2025-05-07T20:25:46.8483775Z 2025-05-07T20:25:46.8483789Z 2025-05-07T20:25:46.8489433Z 2025-05-07T20:25:46.8731010Z libnvjitlink-12.6.85 | 14.9 MB | #########1 | 92%  2025-05-07T20:25:46.8731346Z 2025-05-07T20:25:46.8731350Z 2025-05-07T20:25:46.8731354Z 2025-05-07T20:25:46.8731358Z 2025-05-07T20:25:46.8731361Z 2025-05-07T20:25:46.8731365Z 2025-05-07T20:25:46.8731369Z 2025-05-07T20:25:46.8731373Z 2025-05-07T20:25:46.8731391Z 2025-05-07T20:25:46.8731397Z 2025-05-07T20:25:46.9168497Z gds-tools-1.11.1.6 | 37.8 MB | ########## | 100%  2025-05-07T20:25:46.9168802Z 2025-05-07T20:25:46.9168816Z 2025-05-07T20:25:46.9168819Z 2025-05-07T20:25:46.9168823Z 2025-05-07T20:25:46.9168827Z 2025-05-07T20:25:46.9168831Z 2025-05-07T20:25:46.9168836Z 2025-05-07T20:25:46.9168839Z 2025-05-07T20:25:46.9168843Z 2025-05-07T20:25:46.9169164Z 2025-05-07T20:25:46.9169170Z 2025-05-07T20:25:46.9169175Z 2025-05-07T20:25:46.9169178Z 2025-05-07T20:25:46.9169182Z 2025-05-07T20:25:46.9169186Z 2025-05-07T20:25:46.9169325Z 2025-05-07T20:25:46.9169329Z 2025-05-07T20:25:47.0169901Z cuda-sanitizer-api-1 | 8.9 MB | | 0%  2025-05-07T20:25:47.0170313Z 2025-05-07T20:25:47.0170317Z 2025-05-07T20:25:47.0170329Z 2025-05-07T20:25:47.0170333Z 2025-05-07T20:25:47.0170337Z 2025-05-07T20:25:47.0170340Z 2025-05-07T20:25:47.0170344Z 2025-05-07T20:25:47.0170348Z 2025-05-07T20:25:47.0170351Z 2025-05-07T20:25:47.0170355Z 2025-05-07T20:25:47.0170359Z 2025-05-07T20:25:47.0170372Z 2025-05-07T20:25:47.0170375Z 2025-05-07T20:25:47.0170379Z 2025-05-07T20:25:47.0170383Z 2025-05-07T20:25:47.0170386Z 2025-05-07T20:25:47.0170390Z 2025-05-07T20:25:47.0835338Z cuda-sanitizer-api-1 | 8.9 MB | ##8 | 29%  2025-05-07T20:25:47.0835764Z 2025-05-07T20:25:47.0835770Z 2025-05-07T20:25:47.0835774Z 2025-05-07T20:25:47.0835778Z 2025-05-07T20:25:47.0835782Z 2025-05-07T20:25:47.0835785Z 2025-05-07T20:25:47.0835798Z 2025-05-07T20:25:47.0835802Z 2025-05-07T20:25:47.0835806Z 2025-05-07T20:25:47.0835809Z 2025-05-07T20:25:47.0835813Z 2025-05-07T20:25:47.0835817Z 2025-05-07T20:25:47.0835820Z 2025-05-07T20:25:47.0835824Z 2025-05-07T20:25:47.0843616Z 2025-05-07T20:25:47.1175231Z cuda-nvcc-dev_linux- | 10.8 MB | ########## | 100%  2025-05-07T20:25:47.1175667Z 2025-05-07T20:25:47.1175672Z 2025-05-07T20:25:47.1175677Z 2025-05-07T20:25:47.1175682Z 2025-05-07T20:25:47.1175687Z 2025-05-07T20:25:47.1175692Z 2025-05-07T20:25:47.1175697Z 2025-05-07T20:25:47.1175702Z 2025-05-07T20:25:47.1175708Z 2025-05-07T20:25:47.1175713Z 2025-05-07T20:25:47.1175718Z 2025-05-07T20:25:47.1175731Z 2025-05-07T20:25:47.1175736Z 2025-05-07T20:25:47.1175741Z 2025-05-07T20:25:47.1175746Z 2025-05-07T20:25:47.1175771Z 2025-05-07T20:25:47.1175776Z 2025-05-07T20:25:47.1271024Z cuda-sanitizer-api-1 | 8.9 MB | ######6 | 66%  2025-05-07T20:25:47.1271503Z 2025-05-07T20:25:47.1271507Z 2025-05-07T20:25:47.1271511Z 2025-05-07T20:25:47.1271515Z 2025-05-07T20:25:47.1271518Z 2025-05-07T20:25:47.1271522Z 2025-05-07T20:25:47.1271526Z 2025-05-07T20:25:47.1271530Z 2025-05-07T20:25:47.1271533Z 2025-05-07T20:25:47.1271537Z 2025-05-07T20:25:47.1271540Z 2025-05-07T20:25:47.1271544Z 2025-05-07T20:25:47.1271548Z 2025-05-07T20:25:47.1271551Z 2025-05-07T20:25:47.1271555Z 2025-05-07T20:25:47.1271558Z 2025-05-07T20:25:47.1271562Z 2025-05-07T20:25:47.1275978Z 2025-05-07T20:25:47.2036008Z cuda-nvvm-impl-12.6. | 7.7 MB | | 0%  2025-05-07T20:25:47.2036344Z 2025-05-07T20:25:47.2036348Z 2025-05-07T20:25:47.2036351Z 2025-05-07T20:25:47.2036355Z 2025-05-07T20:25:47.2036359Z 2025-05-07T20:25:47.2036379Z 2025-05-07T20:25:47.2036392Z 2025-05-07T20:25:47.2036396Z 2025-05-07T20:25:47.2036400Z 2025-05-07T20:25:47.2036406Z 2025-05-07T20:25:47.2036412Z 2025-05-07T20:25:47.2036429Z 2025-05-07T20:25:47.2036435Z 2025-05-07T20:25:47.2036440Z 2025-05-07T20:25:47.2036445Z 2025-05-07T20:25:47.2039225Z 2025-05-07T20:25:47.2212325Z cuda-nvvm-tools-12.6 | 10.4 MB | ########## | 100%  2025-05-07T20:25:47.2212717Z 2025-05-07T20:25:47.2212721Z 2025-05-07T20:25:47.2212725Z 2025-05-07T20:25:47.2212728Z 2025-05-07T20:25:47.2212732Z 2025-05-07T20:25:47.2212736Z 2025-05-07T20:25:47.2212740Z 2025-05-07T20:25:47.2212743Z 2025-05-07T20:25:47.2212747Z 2025-05-07T20:25:47.2212751Z 2025-05-07T20:25:47.2212755Z 2025-05-07T20:25:47.2212759Z 2025-05-07T20:25:47.2212762Z 2025-05-07T20:25:47.2212766Z 2025-05-07T20:25:47.2212770Z 2025-05-07T20:25:47.2212774Z 2025-05-07T20:25:47.2215543Z 2025-05-07T20:25:47.2274215Z cuda-sanitizer-api-1 | 8.9 MB | #########9 | 100%  2025-05-07T20:25:47.2274829Z 2025-05-07T20:25:47.2274833Z 2025-05-07T20:25:47.2274847Z 2025-05-07T20:25:47.2274977Z 2025-05-07T20:25:47.2274982Z 2025-05-07T20:25:47.2274985Z 2025-05-07T20:25:47.2274989Z 2025-05-07T20:25:47.2274993Z 2025-05-07T20:25:47.2274997Z 2025-05-07T20:25:47.2275001Z 2025-05-07T20:25:47.2275004Z 2025-05-07T20:25:47.2275008Z 2025-05-07T20:25:47.2275012Z 2025-05-07T20:25:47.2275015Z 2025-05-07T20:25:47.2275019Z 2025-05-07T20:25:47.2275023Z 2025-05-07T20:25:47.2275026Z 2025-05-07T20:25:47.2275030Z 2025-05-07T20:25:47.2578236Z cuda-nvvm-impl-12.6. | 7.7 MB | ###9 | 39%  2025-05-07T20:25:47.2578576Z 2025-05-07T20:25:47.2578580Z 2025-05-07T20:25:47.2578584Z 2025-05-07T20:25:47.2578587Z 2025-05-07T20:25:47.2578591Z 2025-05-07T20:25:47.2578595Z 2025-05-07T20:25:47.2578598Z 2025-05-07T20:25:47.2578602Z 2025-05-07T20:25:47.2578605Z 2025-05-07T20:25:47.2578620Z 2025-05-07T20:25:47.2578630Z 2025-05-07T20:25:47.2578634Z 2025-05-07T20:25:47.2578638Z 2025-05-07T20:25:47.2578641Z 2025-05-07T20:25:47.2578645Z 2025-05-07T20:25:47.2578656Z 2025-05-07T20:25:47.2578660Z 2025-05-07T20:25:47.2578663Z 2025-05-07T20:25:47.2583695Z 2025-05-07T20:25:47.3277048Z ... (more hidden) ... 2025-05-07T20:25:47.3277436Z 2025-05-07T20:25:47.3277440Z 2025-05-07T20:25:47.3277444Z 2025-05-07T20:25:47.3277448Z 2025-05-07T20:25:47.3277452Z 2025-05-07T20:25:47.3277456Z 2025-05-07T20:25:47.3277460Z 2025-05-07T20:25:47.3277464Z 2025-05-07T20:25:47.3277468Z 2025-05-07T20:25:47.3277471Z 2025-05-07T20:25:47.3277475Z 2025-05-07T20:25:47.3277479Z 2025-05-07T20:25:47.3277483Z 2025-05-07T20:25:47.3277486Z 2025-05-07T20:25:47.3277490Z 2025-05-07T20:25:47.3277494Z 2025-05-07T20:25:47.3277497Z 2025-05-07T20:25:47.3278823Z 2025-05-07T20:25:47.3578950Z cuda-nvvm-impl-12.6. | 7.7 MB | ########2 | 82%  2025-05-07T20:25:47.3579309Z 2025-05-07T20:25:47.3579314Z 2025-05-07T20:25:47.3579318Z 2025-05-07T20:25:47.3579322Z 2025-05-07T20:25:47.3579334Z 2025-05-07T20:25:47.3579338Z 2025-05-07T20:25:47.3579341Z 2025-05-07T20:25:47.3579345Z 2025-05-07T20:25:47.3579349Z 2025-05-07T20:25:47.3579352Z 2025-05-07T20:25:47.3579356Z 2025-05-07T20:25:47.3579359Z 2025-05-07T20:25:47.3579363Z 2025-05-07T20:25:47.3579367Z 2025-05-07T20:25:47.3579370Z 2025-05-07T20:25:47.3579374Z 2025-05-07T20:25:47.3579377Z 2025-05-07T20:25:47.3579381Z 2025-05-07T20:25:47.3581941Z 2025-05-07T20:25:47.4174314Z ... (more hidden) ... 2025-05-07T20:25:47.4174617Z 2025-05-07T20:25:47.4174621Z 2025-05-07T20:25:47.4174625Z 2025-05-07T20:25:47.4174628Z 2025-05-07T20:25:47.4174641Z 2025-05-07T20:25:47.4174645Z 2025-05-07T20:25:47.4174649Z 2025-05-07T20:25:47.4174652Z 2025-05-07T20:25:47.4174656Z 2025-05-07T20:25:47.4174660Z 2025-05-07T20:25:47.4174679Z 2025-05-07T20:25:47.4174683Z 2025-05-07T20:25:47.4174687Z 2025-05-07T20:25:47.4174690Z 2025-05-07T20:25:47.5009018Z libnvjitlink-12.6.85 | 14.9 MB | ########## | 100%  2025-05-07T20:25:47.5009411Z 2025-05-07T20:25:47.5009416Z 2025-05-07T20:25:47.5009420Z 2025-05-07T20:25:47.5009424Z 2025-05-07T20:25:47.5009427Z 2025-05-07T20:25:47.5009431Z 2025-05-07T20:25:47.5009435Z 2025-05-07T20:25:47.5009439Z 2025-05-07T20:25:47.5009443Z 2025-05-07T20:25:47.5009447Z 2025-05-07T20:25:47.5009451Z 2025-05-07T20:25:47.5009455Z 2025-05-07T20:25:47.5009458Z 2025-05-07T20:25:47.5009462Z 2025-05-07T20:25:47.5009466Z 2025-05-07T20:25:47.5009469Z 2025-05-07T20:25:47.5009481Z 2025-05-07T20:25:47.5009485Z 2025-05-07T20:25:47.5009489Z 2025-05-07T20:25:47.5076979Z ... (more hidden) ... 2025-05-07T20:25:47.5077310Z 2025-05-07T20:25:47.5077328Z 2025-05-07T20:25:47.5077334Z 2025-05-07T20:25:47.5077614Z 2025-05-07T20:25:47.5077617Z 2025-05-07T20:25:47.5077621Z 2025-05-07T20:25:47.5077625Z 2025-05-07T20:25:47.5077629Z 2025-05-07T20:25:47.5077764Z 2025-05-07T20:25:47.5077769Z 2025-05-07T20:25:47.5077773Z 2025-05-07T20:25:47.5077777Z 2025-05-07T20:25:47.5077781Z 2025-05-07T20:25:47.5077784Z 2025-05-07T20:25:47.5077788Z 2025-05-07T20:25:47.5077792Z 2025-05-07T20:25:47.5077795Z 2025-05-07T20:25:47.6094595Z cuda-sanitizer-api-1 | 8.9 MB | ########## | 100%  2025-05-07T20:25:47.6094990Z 2025-05-07T20:25:47.6094995Z 2025-05-07T20:25:47.6094999Z 2025-05-07T20:25:47.6095003Z 2025-05-07T20:25:47.6095006Z 2025-05-07T20:25:47.6095010Z 2025-05-07T20:25:47.6095014Z 2025-05-07T20:25:47.6095018Z 2025-05-07T20:25:47.6095022Z 2025-05-07T20:25:47.6095026Z 2025-05-07T20:25:47.6095037Z 2025-05-07T20:25:47.6095041Z 2025-05-07T20:25:47.6095045Z 2025-05-07T20:25:47.6095048Z 2025-05-07T20:25:47.6095053Z 2025-05-07T20:25:47.6095056Z 2025-05-07T20:25:47.6095086Z 2025-05-07T20:25:47.6095090Z 2025-05-07T20:25:48.3950107Z cuda-nvvm-impl-12.6. | 7.7 MB | ########## | 100%  2025-05-07T20:25:48.3950460Z 2025-05-07T20:25:48.3950465Z 2025-05-07T20:25:48.3950468Z 2025-05-07T20:25:48.3950472Z 2025-05-07T20:25:48.3950476Z 2025-05-07T20:25:48.3950480Z 2025-05-07T20:25:49.3697209Z libcusolver-11.7.1.2 | 95.8 MB | ########## | 100%  2025-05-07T20:25:49.3697525Z 2025-05-07T20:25:49.3697530Z 2025-05-07T20:25:49.3697534Z 2025-05-07T20:25:49.3697537Z 2025-05-07T20:25:49.3697541Z 2025-05-07T20:25:49.7778330Z cuda-nvvp-12.6.80 | 109.3 MB | ########## | 100%  2025-05-07T20:25:49.7778630Z 2025-05-07T20:25:49.7778635Z 2025-05-07T20:25:49.7778638Z 2025-05-07T20:25:49.7778642Z 2025-05-07T20:25:49.7778646Z 2025-05-07T20:25:49.7778650Z 2025-05-07T20:25:49.7778662Z 2025-05-07T20:25:49.7778667Z 2025-05-07T20:25:50.2642806Z cuda-nvdisasm-12.6.7 | 47.6 MB | ########## | 100%  2025-05-07T20:25:50.2643171Z 2025-05-07T20:25:50.2643175Z 2025-05-07T20:25:50.2643188Z 2025-05-07T20:25:50.2643204Z 2025-05-07T20:25:50.2643208Z 2025-05-07T20:25:50.2643212Z 2025-05-07T20:25:50.2643216Z 2025-05-07T20:25:50.4483974Z libnpp-12.3.1.54 | 93.4 MB | ########## | 100%  2025-05-07T20:25:50.4484274Z 2025-05-07T20:25:50.4484278Z 2025-05-07T20:25:50.4484292Z 2025-05-07T20:25:50.4484296Z 2025-05-07T20:25:50.4484300Z 2025-05-07T20:25:50.4484303Z 2025-05-07T20:25:50.4484307Z 2025-05-07T20:25:50.4484311Z 2025-05-07T20:25:50.4484321Z 2025-05-07T20:25:50.4613858Z libcurand-10.3.7.77 | 39.9 MB | ########## | 100%  2025-05-07T20:25:50.6209146Z nsight-compute-2024. | 443.1 MB | ########## | 100% 2025-05-07T20:25:50.6209483Z 2025-05-07T20:25:50.6209488Z 2025-05-07T20:25:50.6209492Z 2025-05-07T20:25:50.6209505Z 2025-05-07T20:25:50.6209509Z 2025-05-07T20:25:50.6209513Z 2025-05-07T20:25:50.6209545Z 2025-05-07T20:25:50.6209550Z 2025-05-07T20:25:50.6209555Z 2025-05-07T20:25:50.6209559Z 2025-05-07T20:25:50.6209566Z 2025-05-07T20:25:50.6209570Z 2025-05-07T20:25:50.6209590Z 2025-05-07T20:25:50.9102234Z cuda-nvrtc-12.6.85 | 17.3 MB | ########## | 100%  2025-05-07T20:25:50.9102559Z 2025-05-07T20:25:50.9102563Z 2025-05-07T20:25:50.9102566Z 2025-05-07T20:25:50.9102570Z 2025-05-07T20:25:50.9102574Z 2025-05-07T20:25:50.9102578Z 2025-05-07T20:25:50.9102582Z 2025-05-07T20:25:50.9102585Z 2025-05-07T20:25:50.9102589Z 2025-05-07T20:25:50.9102593Z 2025-05-07T20:25:50.9102597Z 2025-05-07T20:25:50.9102604Z 2025-05-07T20:25:51.2212076Z cuda-nvcc-tools-12.6 | 23.0 MB | ########## | 100%  2025-05-07T20:25:51.2212401Z 2025-05-07T20:25:51.2212405Z 2025-05-07T20:25:51.2212409Z 2025-05-07T20:25:51.2212413Z 2025-05-07T20:25:51.2212417Z 2025-05-07T20:25:51.2212421Z 2025-05-07T20:25:51.2212424Z 2025-05-07T20:25:51.2212712Z 2025-05-07T20:25:51.2212716Z 2025-05-07T20:25:51.2212725Z 2025-05-07T20:25:51.5905352Z gds-tools-1.11.1.6 | 37.8 MB | ########## | 100%  2025-05-07T20:25:51.5905687Z 2025-05-07T20:25:51.5905691Z 2025-05-07T20:25:51.5905695Z 2025-05-07T20:25:51.5905699Z 2025-05-07T20:25:51.5905703Z 2025-05-07T20:25:51.5905706Z 2025-05-07T20:25:51.5905710Z 2025-05-07T20:25:51.5905714Z 2025-05-07T20:25:51.5905718Z 2025-05-07T20:25:51.5905722Z 2025-05-07T20:25:51.5905725Z 2025-05-07T20:25:51.5905729Z 2025-05-07T20:25:51.5905733Z 2025-05-07T20:25:51.5905736Z 2025-05-07T20:25:51.5905740Z 2025-05-07T20:25:51.7838001Z cuda-nvcc-dev_linux- | 10.8 MB | ########## | 100%  2025-05-07T20:25:51.7838346Z 2025-05-07T20:25:51.7838350Z 2025-05-07T20:25:51.7838353Z 2025-05-07T20:25:51.7838357Z 2025-05-07T20:25:51.7838362Z 2025-05-07T20:25:51.7838366Z 2025-05-07T20:25:51.7838370Z 2025-05-07T20:25:51.7838374Z 2025-05-07T20:25:51.7838414Z 2025-05-07T20:25:51.7838418Z 2025-05-07T20:25:51.7838422Z 2025-05-07T20:25:51.7838425Z 2025-05-07T20:25:51.7838429Z 2025-05-07T20:25:51.7838433Z 2025-05-07T20:25:51.7838453Z 2025-05-07T20:25:51.7838461Z 2025-05-07T20:25:52.0190553Z cuda-nvvm-tools-12.6 | 10.4 MB | ########## | 100%  2025-05-07T20:25:52.0190911Z 2025-05-07T20:25:52.0190915Z 2025-05-07T20:25:52.0190919Z 2025-05-07T20:25:52.0190923Z 2025-05-07T20:25:52.0190926Z 2025-05-07T20:25:52.0190930Z 2025-05-07T20:25:52.0190933Z 2025-05-07T20:25:52.0190937Z 2025-05-07T20:25:52.0190941Z 2025-05-07T20:25:52.0190944Z 2025-05-07T20:25:52.0190948Z 2025-05-07T20:25:52.0829230Z python-3.10.13 | 24.5 MB | ########## | 100%  2025-05-07T20:25:52.0829543Z 2025-05-07T20:25:52.0829548Z 2025-05-07T20:25:52.0829551Z 2025-05-07T20:25:52.0829555Z 2025-05-07T20:25:52.0829560Z 2025-05-07T20:25:52.0829564Z 2025-05-07T20:25:52.0829572Z 2025-05-07T20:25:52.0829600Z 2025-05-07T20:25:52.0829604Z 2025-05-07T20:25:52.0829607Z 2025-05-07T20:25:52.0829611Z 2025-05-07T20:25:52.0829615Z 2025-05-07T20:25:52.0829619Z 2025-05-07T20:25:52.0829637Z 2025-05-07T20:25:52.2444494Z libnvjitlink-12.6.85 | 14.9 MB | ########## | 100%  2025-05-07T20:25:52.2444823Z 2025-05-07T20:25:52.2444827Z 2025-05-07T20:25:52.2444831Z 2025-05-07T20:25:52.2444834Z 2025-05-07T20:25:52.2444838Z 2025-05-07T20:25:52.2444850Z 2025-05-07T20:25:52.2444854Z 2025-05-07T20:25:52.2444858Z 2025-05-07T20:25:52.2444862Z 2025-05-07T20:25:52.2444865Z 2025-05-07T20:25:52.2444869Z 2025-05-07T20:25:52.2444873Z 2025-05-07T20:25:52.2444877Z 2025-05-07T20:25:52.2444881Z 2025-05-07T20:25:52.2444885Z 2025-05-07T20:25:52.2444889Z 2025-05-07T20:25:52.2444893Z 2025-05-07T20:25:52.2444896Z 2025-05-07T20:25:52.2444900Z 2025-05-07T20:25:52.3089621Z ... (more hidden) ... 2025-05-07T20:25:52.3090191Z 2025-05-07T20:25:52.3090229Z 2025-05-07T20:25:52.3090233Z 2025-05-07T20:25:52.3090237Z 2025-05-07T20:25:52.3090241Z 2025-05-07T20:25:52.3090244Z 2025-05-07T20:25:52.3090260Z 2025-05-07T20:25:52.3090273Z 2025-05-07T20:25:52.3090277Z 2025-05-07T20:25:52.3090280Z 2025-05-07T20:25:52.3090284Z 2025-05-07T20:25:52.3090288Z 2025-05-07T20:25:52.3090291Z 2025-05-07T20:25:52.3090295Z 2025-05-07T20:25:52.3090306Z 2025-05-07T20:25:52.3090310Z 2025-05-07T20:25:52.3090313Z 2025-05-07T20:25:52.3784921Z cuda-sanitizer-api-1 | 8.9 MB | ########## | 100%  2025-05-07T20:25:52.3785392Z 2025-05-07T20:25:52.3785396Z 2025-05-07T20:25:52.3785400Z 2025-05-07T20:25:52.3785404Z 2025-05-07T20:25:52.3785408Z 2025-05-07T20:25:52.3785412Z 2025-05-07T20:25:52.3785417Z 2025-05-07T20:25:52.3785421Z 2025-05-07T20:25:52.3785424Z 2025-05-07T20:25:52.3785428Z 2025-05-07T20:25:52.3785432Z 2025-05-07T20:25:52.3785435Z 2025-05-07T20:25:52.3785439Z 2025-05-07T20:25:52.3785717Z 2025-05-07T20:25:52.3785721Z 2025-05-07T20:25:52.3785725Z 2025-05-07T20:25:52.3785740Z 2025-05-07T20:25:52.3785747Z 2025-05-07T20:25:52.9180819Z cuda-nvvm-impl-12.6. | 7.7 MB | ########## | 100%  2025-05-07T20:25:52.9181195Z 2025-05-07T20:25:58.3146408Z libcublas-12.6.4.1 | 256.2 MB | ########## | 100%  2025-05-07T20:25:58.3155845Z nsight-compute-2024. | 443.1 MB | ########## | 100% 2025-05-07T20:25:58.3156109Z 2025-05-07T20:25:58.3156113Z 2025-05-07T20:25:58.3156117Z 2025-05-07T20:25:58.3156121Z 2025-05-07T20:25:58.3156125Z 2025-05-07T20:25:58.3156129Z 2025-05-07T20:25:58.3156133Z 2025-05-07T20:25:58.3156137Z 2025-05-07T20:25:58.3156141Z 2025-05-07T20:25:58.3156145Z 2025-05-07T20:25:58.3156149Z 2025-05-07T20:25:58.3156153Z 2025-05-07T20:25:58.3156156Z 2025-05-07T20:25:58.3156167Z 2025-05-07T20:25:58.3156171Z 2025-05-07T20:25:58.3156175Z 2025-05-07T20:25:58.3156178Z 2025-05-07T20:25:58.3156182Z 2025-05-07T20:25:58.3156204Z 2025-05-07T20:25:58.3156289Z 2025-05-07T20:25:58.3156634Z  2025-05-07T20:25:58.3156955Z 2025-05-07T20:25:58.3157156Z 2025-05-07T20:25:58.3157767Z  2025-05-07T20:25:58.3158197Z 2025-05-07T20:25:58.3158205Z 2025-05-07T20:25:58.3158508Z  2025-05-07T20:25:58.3158842Z 2025-05-07T20:25:58.3158851Z 2025-05-07T20:25:58.3158858Z 2025-05-07T20:25:58.3159236Z  2025-05-07T20:25:58.3159631Z 2025-05-07T20:25:58.3159637Z 2025-05-07T20:25:58.3159642Z 2025-05-07T20:25:58.3159647Z 2025-05-07T20:25:58.3160346Z  2025-05-07T20:25:58.3160717Z 2025-05-07T20:25:58.3160723Z 2025-05-07T20:25:58.3160729Z 2025-05-07T20:25:58.3160735Z 2025-05-07T20:25:58.3160756Z 2025-05-07T20:25:58.3161182Z  2025-05-07T20:25:58.3161592Z 2025-05-07T20:25:58.3161607Z 2025-05-07T20:25:58.3161613Z 2025-05-07T20:25:58.3161618Z 2025-05-07T20:25:58.3161623Z 2025-05-07T20:25:58.3161628Z 2025-05-07T20:25:58.3162272Z  2025-05-07T20:25:58.3162647Z 2025-05-07T20:25:58.3162652Z 2025-05-07T20:25:58.3162657Z 2025-05-07T20:25:58.3162663Z 2025-05-07T20:25:58.3162668Z 2025-05-07T20:25:58.3162674Z 2025-05-07T20:25:58.3162685Z 2025-05-07T20:25:58.3163219Z  2025-05-07T20:25:58.3163607Z 2025-05-07T20:25:58.3163613Z 2025-05-07T20:25:58.3163619Z 2025-05-07T20:25:58.3163624Z 2025-05-07T20:25:58.3163630Z 2025-05-07T20:25:58.3163635Z 2025-05-07T20:25:58.3163640Z 2025-05-07T20:25:58.3163651Z 2025-05-07T20:25:58.3164160Z  2025-05-07T20:25:58.3164517Z 2025-05-07T20:25:58.3164522Z 2025-05-07T20:25:58.3164528Z 2025-05-07T20:25:58.3164533Z 2025-05-07T20:25:58.3164545Z 2025-05-07T20:25:58.3164551Z 2025-05-07T20:25:58.3164567Z 2025-05-07T20:25:58.3164573Z 2025-05-07T20:25:58.3164578Z 2025-05-07T20:25:58.3165068Z  2025-05-07T20:25:58.3165421Z 2025-05-07T20:25:58.3165426Z 2025-05-07T20:25:58.3165440Z 2025-05-07T20:25:58.3165446Z 2025-05-07T20:25:58.3165451Z 2025-05-07T20:25:58.3165456Z 2025-05-07T20:25:58.3165462Z 2025-05-07T20:25:58.3165471Z 2025-05-07T20:25:58.3165477Z 2025-05-07T20:25:58.3165482Z 2025-05-07T20:25:58.3165996Z  2025-05-07T20:25:58.3166361Z 2025-05-07T20:25:58.3166366Z 2025-05-07T20:25:58.3166372Z 2025-05-07T20:25:58.3166377Z 2025-05-07T20:25:58.3166383Z 2025-05-07T20:25:58.3166388Z 2025-05-07T20:25:58.3166688Z 2025-05-07T20:25:58.3166693Z 2025-05-07T20:25:58.3166700Z 2025-05-07T20:25:58.3166705Z 2025-05-07T20:25:58.3166720Z 2025-05-07T20:25:58.3167224Z  2025-05-07T20:25:58.3167614Z 2025-05-07T20:25:58.3167621Z 2025-05-07T20:25:58.3167626Z 2025-05-07T20:25:58.3167632Z 2025-05-07T20:25:58.3167638Z 2025-05-07T20:25:58.3167644Z 2025-05-07T20:25:58.3167649Z 2025-05-07T20:25:58.3167655Z 2025-05-07T20:25:58.3167661Z 2025-05-07T20:25:58.3167667Z 2025-05-07T20:25:58.3167673Z 2025-05-07T20:25:58.3167679Z 2025-05-07T20:25:58.3168019Z  2025-05-07T20:25:58.3168382Z 2025-05-07T20:25:58.3168388Z 2025-05-07T20:25:58.3168394Z 2025-05-07T20:25:58.3168400Z 2025-05-07T20:25:58.3168406Z 2025-05-07T20:25:58.3168412Z 2025-05-07T20:25:58.3168418Z 2025-05-07T20:25:58.3168432Z 2025-05-07T20:25:58.3168438Z 2025-05-07T20:25:58.3168456Z 2025-05-07T20:25:58.3168462Z 2025-05-07T20:25:58.3168468Z 2025-05-07T20:25:58.3168474Z 2025-05-07T20:25:58.3168816Z  2025-05-07T20:25:58.3169184Z 2025-05-07T20:25:58.3169190Z 2025-05-07T20:25:58.3169196Z 2025-05-07T20:25:58.3169202Z 2025-05-07T20:25:58.3169208Z 2025-05-07T20:25:58.3169214Z 2025-05-07T20:25:58.3169220Z 2025-05-07T20:25:58.3169225Z 2025-05-07T20:25:58.3169231Z 2025-05-07T20:25:58.3169237Z 2025-05-07T20:25:58.3169243Z 2025-05-07T20:25:58.3169249Z 2025-05-07T20:25:58.3169255Z 2025-05-07T20:25:58.3169261Z 2025-05-07T20:25:58.3169612Z  2025-05-07T20:25:58.3169979Z 2025-05-07T20:25:58.3169986Z 2025-05-07T20:25:58.3169992Z 2025-05-07T20:25:58.3169998Z 2025-05-07T20:25:58.3170004Z 2025-05-07T20:25:58.3170010Z 2025-05-07T20:25:58.3170016Z 2025-05-07T20:25:58.3170022Z 2025-05-07T20:25:58.3170036Z 2025-05-07T20:25:58.3170042Z 2025-05-07T20:25:58.3170047Z 2025-05-07T20:25:58.3170053Z 2025-05-07T20:25:58.3170066Z 2025-05-07T20:25:58.3170072Z 2025-05-07T20:25:58.3170086Z 2025-05-07T20:25:58.3170431Z  2025-05-07T20:25:58.3170795Z 2025-05-07T20:25:58.3170801Z 2025-05-07T20:25:58.3170831Z 2025-05-07T20:25:58.3170837Z 2025-05-07T20:25:58.3170843Z 2025-05-07T20:25:58.3170849Z 2025-05-07T20:25:58.3170855Z 2025-05-07T20:25:58.3170861Z 2025-05-07T20:25:58.3170867Z 2025-05-07T20:25:58.3170873Z 2025-05-07T20:25:58.3170878Z 2025-05-07T20:25:58.3170884Z 2025-05-07T20:25:58.3170890Z 2025-05-07T20:25:58.3170896Z 2025-05-07T20:25:58.3170902Z 2025-05-07T20:25:58.3170908Z 2025-05-07T20:25:58.3171262Z  2025-05-07T20:25:58.3171637Z 2025-05-07T20:25:58.3171643Z 2025-05-07T20:25:58.3171649Z 2025-05-07T20:25:58.3171663Z 2025-05-07T20:25:58.3171669Z 2025-05-07T20:25:58.3171675Z 2025-05-07T20:25:58.3171681Z 2025-05-07T20:25:58.3171698Z 2025-05-07T20:25:58.3171710Z 2025-05-07T20:25:58.3171716Z 2025-05-07T20:25:58.3171722Z 2025-05-07T20:25:58.3171728Z 2025-05-07T20:25:58.3171733Z 2025-05-07T20:25:58.3171748Z 2025-05-07T20:25:58.3171753Z 2025-05-07T20:25:58.3171759Z 2025-05-07T20:25:58.3171765Z 2025-05-07T20:25:58.3172132Z  2025-05-07T20:25:58.3172515Z 2025-05-07T20:25:58.3172521Z 2025-05-07T20:25:58.3172535Z 2025-05-07T20:25:58.3172541Z 2025-05-07T20:25:58.3172547Z 2025-05-07T20:25:58.3172553Z 2025-05-07T20:25:58.3172559Z 2025-05-07T20:25:58.3172565Z 2025-05-07T20:25:58.3172571Z 2025-05-07T20:25:58.3172577Z 2025-05-07T20:25:58.3172583Z 2025-05-07T20:25:58.3172589Z 2025-05-07T20:25:58.3172595Z 2025-05-07T20:25:58.3172601Z 2025-05-07T20:25:58.3172607Z 2025-05-07T20:25:58.3172746Z 2025-05-07T20:25:58.3172751Z 2025-05-07T20:25:58.3172756Z 2025-05-07T20:25:58.3173671Z  2025-05-07T20:25:58.3173938Z 2025-05-07T20:25:58.3173943Z 2025-05-07T20:25:58.3174060Z  2025-05-07T20:25:58.3174165Z 2025-05-07T20:25:58.3174169Z 2025-05-07T20:25:58.3174843Z  2025-05-07T20:25:58.3175035Z 2025-05-07T20:25:58.3175042Z 2025-05-07T20:25:58.3175051Z 2025-05-07T20:25:58.3175726Z  2025-05-07T20:25:58.3175916Z 2025-05-07T20:25:58.3175923Z 2025-05-07T20:25:58.3175929Z 2025-05-07T20:25:58.3175938Z 2025-05-07T20:25:58.3176336Z  2025-05-07T20:25:58.3176506Z 2025-05-07T20:25:58.3176517Z 2025-05-07T20:25:58.3176522Z 2025-05-07T20:25:58.3176527Z 2025-05-07T20:25:58.3176533Z 2025-05-07T20:25:58.3177128Z  2025-05-07T20:25:58.3177278Z 2025-05-07T20:25:58.3177283Z 2025-05-07T20:25:58.3177287Z 2025-05-07T20:25:58.3177290Z 2025-05-07T20:25:58.3177311Z 2025-05-07T20:25:58.3177315Z 2025-05-07T20:25:58.3177822Z  2025-05-07T20:25:58.3178052Z 2025-05-07T20:25:58.3178058Z 2025-05-07T20:25:58.3178075Z 2025-05-07T20:25:58.3178081Z 2025-05-07T20:25:58.3178087Z 2025-05-07T20:25:58.3178093Z 2025-05-07T20:25:58.3178103Z 2025-05-07T20:25:58.3178585Z  2025-05-07T20:25:58.3178831Z 2025-05-07T20:25:58.3178837Z 2025-05-07T20:25:58.3178843Z 2025-05-07T20:25:58.3178849Z 2025-05-07T20:25:58.3178855Z 2025-05-07T20:25:58.3178860Z 2025-05-07T20:25:58.3178871Z 2025-05-07T20:25:58.3178876Z 2025-05-07T20:25:58.3179387Z  2025-05-07T20:25:58.3179579Z 2025-05-07T20:25:58.3179590Z 2025-05-07T20:25:58.3179594Z 2025-05-07T20:25:58.3179605Z 2025-05-07T20:25:58.3179609Z 2025-05-07T20:25:58.3179612Z 2025-05-07T20:25:58.3179616Z 2025-05-07T20:25:58.3179619Z 2025-05-07T20:25:58.3179623Z 2025-05-07T20:25:58.3180218Z  2025-05-07T20:25:58.3180501Z 2025-05-07T20:25:58.3180507Z 2025-05-07T20:25:58.3180523Z 2025-05-07T20:25:58.3180529Z 2025-05-07T20:25:58.3180535Z 2025-05-07T20:25:58.3180547Z 2025-05-07T20:25:58.3180553Z 2025-05-07T20:25:58.3180566Z 2025-05-07T20:25:58.3180572Z 2025-05-07T20:25:58.3180578Z 2025-05-07T20:25:58.3181096Z  2025-05-07T20:25:58.3181376Z 2025-05-07T20:25:58.3181382Z 2025-05-07T20:25:58.3181388Z 2025-05-07T20:25:58.3181394Z 2025-05-07T20:25:58.3181400Z 2025-05-07T20:25:58.3181411Z 2025-05-07T20:25:58.3181417Z 2025-05-07T20:25:58.3181422Z 2025-05-07T20:25:58.3181427Z 2025-05-07T20:25:58.3181432Z 2025-05-07T20:25:58.3181446Z 2025-05-07T20:25:58.3181855Z  2025-05-07T20:25:58.3182049Z 2025-05-07T20:25:58.3182053Z 2025-05-07T20:25:58.3182057Z 2025-05-07T20:25:58.3182061Z 2025-05-07T20:25:58.3182065Z 2025-05-07T20:25:58.3182068Z 2025-05-07T20:25:58.3182077Z 2025-05-07T20:25:58.3182081Z 2025-05-07T20:25:58.3182084Z 2025-05-07T20:25:58.3182088Z 2025-05-07T20:25:58.3182092Z 2025-05-07T20:25:58.3182103Z 2025-05-07T20:25:58.3189570Z  2025-05-07T20:25:58.3190075Z 2025-05-07T20:25:58.3190081Z 2025-05-07T20:25:58.3190100Z 2025-05-07T20:25:58.3190106Z 2025-05-07T20:25:58.3190111Z 2025-05-07T20:25:58.3190116Z 2025-05-07T20:25:58.3190147Z 2025-05-07T20:25:58.3190152Z 2025-05-07T20:25:58.3190168Z 2025-05-07T20:25:58.3190174Z 2025-05-07T20:25:58.3190179Z 2025-05-07T20:25:58.3190184Z 2025-05-07T20:25:58.3190189Z 2025-05-07T20:25:58.3190384Z  2025-05-07T20:25:58.3190578Z 2025-05-07T20:25:58.3190582Z 2025-05-07T20:25:58.3190586Z 2025-05-07T20:25:58.3190598Z 2025-05-07T20:25:58.3190602Z 2025-05-07T20:25:58.3190606Z 2025-05-07T20:25:58.3190610Z 2025-05-07T20:25:58.3190613Z 2025-05-07T20:25:58.3190617Z 2025-05-07T20:25:58.3190621Z 2025-05-07T20:25:58.3190624Z 2025-05-07T20:25:58.3190628Z 2025-05-07T20:25:58.3190631Z 2025-05-07T20:25:58.3190635Z 2025-05-07T20:25:58.3190778Z  2025-05-07T20:25:58.3191172Z 2025-05-07T20:25:58.3191176Z 2025-05-07T20:25:58.3191179Z 2025-05-07T20:25:58.3191183Z 2025-05-07T20:25:58.3191285Z 2025-05-07T20:25:58.3191289Z 2025-05-07T20:25:58.3191293Z 2025-05-07T20:25:58.3191297Z 2025-05-07T20:25:58.3191300Z 2025-05-07T20:25:58.3191304Z 2025-05-07T20:25:58.3191307Z 2025-05-07T20:25:58.3191311Z 2025-05-07T20:25:58.3191314Z 2025-05-07T20:25:58.3191318Z 2025-05-07T20:25:58.3191322Z 2025-05-07T20:25:58.3191480Z  2025-05-07T20:25:58.3191676Z 2025-05-07T20:25:58.3191680Z 2025-05-07T20:25:58.3191684Z 2025-05-07T20:25:58.3191687Z 2025-05-07T20:25:58.3191691Z 2025-05-07T20:25:58.3191695Z 2025-05-07T20:25:58.3191698Z 2025-05-07T20:25:58.3191702Z 2025-05-07T20:25:58.3191705Z 2025-05-07T20:25:58.3191709Z 2025-05-07T20:25:58.3191712Z 2025-05-07T20:25:58.3191716Z 2025-05-07T20:25:58.3191727Z 2025-05-07T20:25:58.3191730Z 2025-05-07T20:25:58.3191734Z 2025-05-07T20:25:58.3191738Z 2025-05-07T20:25:58.3191893Z  2025-05-07T20:25:58.3192096Z 2025-05-07T20:25:58.3192100Z 2025-05-07T20:25:58.3192108Z 2025-05-07T20:25:58.3192124Z 2025-05-07T20:25:58.3192128Z 2025-05-07T20:25:58.3192131Z 2025-05-07T20:25:58.3192135Z 2025-05-07T20:25:58.3192138Z 2025-05-07T20:25:58.3192142Z 2025-05-07T20:25:58.3192145Z 2025-05-07T20:25:58.3192149Z 2025-05-07T20:25:58.3192153Z 2025-05-07T20:25:58.3192156Z 2025-05-07T20:25:58.3192160Z 2025-05-07T20:25:58.3192195Z 2025-05-07T20:25:58.3192199Z 2025-05-07T20:25:58.3192203Z 2025-05-07T20:25:58.3192390Z  2025-05-07T20:25:58.3192685Z 2025-05-07T20:25:58.3192688Z 2025-05-07T20:25:58.3192692Z 2025-05-07T20:25:58.3192695Z 2025-05-07T20:25:58.3192707Z 2025-05-07T20:25:58.3192711Z 2025-05-07T20:25:58.3192715Z 2025-05-07T20:25:58.3192718Z 2025-05-07T20:25:58.3192722Z 2025-05-07T20:25:58.3192726Z 2025-05-07T20:25:58.3192729Z 2025-05-07T20:25:58.3192746Z 2025-05-07T20:25:58.3192749Z 2025-05-07T20:25:58.3192753Z 2025-05-07T20:25:58.3192756Z 2025-05-07T20:25:58.3192760Z 2025-05-07T20:25:58.3192764Z 2025-05-07T20:25:58.3192772Z 2025-05-07T20:25:58.3192995Z  2025-05-07T20:25:58.3193245Z 2025-05-07T20:25:58.3193249Z 2025-05-07T20:25:58.3193351Z  2025-05-07T20:25:58.3193500Z 2025-05-07T20:25:58.3193505Z 2025-05-07T20:25:58.3193646Z  2025-05-07T20:25:58.3193755Z 2025-05-07T20:25:58.3193759Z 2025-05-07T20:25:58.3193762Z 2025-05-07T20:25:58.3193875Z  2025-05-07T20:25:58.3193983Z 2025-05-07T20:25:58.3193987Z 2025-05-07T20:25:58.3193991Z 2025-05-07T20:25:58.3193994Z 2025-05-07T20:25:58.3194106Z  2025-05-07T20:25:58.3194227Z 2025-05-07T20:25:58.3194230Z 2025-05-07T20:25:58.3194234Z 2025-05-07T20:25:58.3194238Z 2025-05-07T20:25:58.3194241Z 2025-05-07T20:25:58.3194347Z  2025-05-07T20:25:58.3194481Z 2025-05-07T20:25:58.3194485Z 2025-05-07T20:25:58.3194494Z 2025-05-07T20:25:58.3194498Z 2025-05-07T20:25:58.3194502Z 2025-05-07T20:25:58.3194505Z 2025-05-07T20:25:58.3194616Z  2025-05-07T20:25:58.3194755Z 2025-05-07T20:25:58.3194759Z 2025-05-07T20:25:58.3194762Z 2025-05-07T20:25:58.3194766Z 2025-05-07T20:25:58.3194770Z 2025-05-07T20:25:58.3194773Z 2025-05-07T20:25:58.3194777Z 2025-05-07T20:25:58.3194893Z  2025-05-07T20:25:58.3195038Z 2025-05-07T20:25:58.3195042Z 2025-05-07T20:25:58.3195045Z 2025-05-07T20:25:58.3195049Z 2025-05-07T20:25:58.3195052Z 2025-05-07T20:25:58.3195056Z 2025-05-07T20:25:58.3195060Z 2025-05-07T20:25:58.3195063Z 2025-05-07T20:25:58.3195182Z  2025-05-07T20:25:58.3195336Z 2025-05-07T20:25:58.3195340Z 2025-05-07T20:25:58.3195343Z 2025-05-07T20:25:58.3195347Z 2025-05-07T20:25:58.3195351Z 2025-05-07T20:25:58.3195354Z 2025-05-07T20:25:58.3195358Z 2025-05-07T20:25:58.3195362Z 2025-05-07T20:25:58.3195365Z 2025-05-07T20:25:58.3195486Z  2025-05-07T20:25:58.3195751Z 2025-05-07T20:25:58.3195754Z 2025-05-07T20:25:58.3195758Z 2025-05-07T20:25:58.3195762Z 2025-05-07T20:25:58.3195765Z 2025-05-07T20:25:58.3195838Z 2025-05-07T20:25:58.3195842Z 2025-05-07T20:25:58.3195846Z 2025-05-07T20:25:58.3195850Z 2025-05-07T20:25:58.3195853Z 2025-05-07T20:25:58.3195990Z  2025-05-07T20:25:58.3196151Z 2025-05-07T20:25:58.3196155Z 2025-05-07T20:25:58.3196159Z 2025-05-07T20:25:58.3196162Z 2025-05-07T20:25:58.3196166Z 2025-05-07T20:25:58.3196169Z 2025-05-07T20:25:58.3196173Z 2025-05-07T20:25:58.3196176Z 2025-05-07T20:25:58.3196180Z 2025-05-07T20:25:58.3196184Z 2025-05-07T20:25:58.3196187Z 2025-05-07T20:25:58.3196326Z  2025-05-07T20:25:58.3196500Z 2025-05-07T20:25:58.3196503Z 2025-05-07T20:25:58.3196507Z 2025-05-07T20:25:58.3196511Z 2025-05-07T20:25:58.3196514Z 2025-05-07T20:25:58.3196518Z 2025-05-07T20:25:58.3196521Z 2025-05-07T20:25:58.3196525Z 2025-05-07T20:25:58.3196534Z 2025-05-07T20:25:58.3196538Z 2025-05-07T20:25:58.3196541Z 2025-05-07T20:25:58.3196552Z 2025-05-07T20:25:58.3196687Z  2025-05-07T20:25:58.3196872Z 2025-05-07T20:25:58.3196875Z 2025-05-07T20:25:58.3196879Z 2025-05-07T20:25:58.3196883Z 2025-05-07T20:25:58.3196887Z 2025-05-07T20:25:58.3196890Z 2025-05-07T20:25:58.3196901Z 2025-05-07T20:25:58.3196905Z 2025-05-07T20:25:58.3196909Z 2025-05-07T20:25:58.3196913Z 2025-05-07T20:25:58.3196916Z 2025-05-07T20:25:58.3196920Z 2025-05-07T20:25:58.3196924Z 2025-05-07T20:25:58.3197056Z  2025-05-07T20:25:58.3197237Z 2025-05-07T20:25:58.3197248Z 2025-05-07T20:25:58.3197252Z 2025-05-07T20:25:58.3197255Z 2025-05-07T20:25:58.3197259Z 2025-05-07T20:25:58.3197263Z 2025-05-07T20:25:58.3197266Z 2025-05-07T20:25:58.3197270Z 2025-05-07T20:25:58.3197274Z 2025-05-07T20:25:58.3197277Z 2025-05-07T20:25:58.3197281Z 2025-05-07T20:25:58.3197285Z 2025-05-07T20:25:58.3197288Z 2025-05-07T20:25:58.3197296Z 2025-05-07T20:25:58.3197437Z  2025-05-07T20:25:58.3197634Z 2025-05-07T20:25:58.3197638Z 2025-05-07T20:25:58.3197646Z 2025-05-07T20:25:58.3197650Z 2025-05-07T20:25:58.3197653Z 2025-05-07T20:25:58.3197657Z 2025-05-07T20:25:58.3197661Z 2025-05-07T20:25:58.3197664Z 2025-05-07T20:25:58.3197668Z 2025-05-07T20:25:58.3197671Z 2025-05-07T20:25:58.3197675Z 2025-05-07T20:25:58.3197679Z 2025-05-07T20:25:58.3197682Z 2025-05-07T20:25:58.3197686Z 2025-05-07T20:25:58.3197689Z 2025-05-07T20:25:58.3197844Z  2025-05-07T20:25:58.3198042Z 2025-05-07T20:25:58.3198045Z 2025-05-07T20:25:58.3198049Z 2025-05-07T20:25:58.3198053Z 2025-05-07T20:25:58.3198057Z 2025-05-07T20:25:58.3198060Z 2025-05-07T20:25:58.3198064Z 2025-05-07T20:25:58.3198067Z 2025-05-07T20:25:58.3198071Z 2025-05-07T20:25:58.3198075Z 2025-05-07T20:25:58.3198086Z 2025-05-07T20:25:58.3198090Z 2025-05-07T20:25:58.3198093Z 2025-05-07T20:25:58.3198102Z 2025-05-07T20:25:58.3198105Z 2025-05-07T20:25:58.3198109Z 2025-05-07T20:25:58.3198277Z  2025-05-07T20:25:58.3198487Z 2025-05-07T20:25:58.3198490Z 2025-05-07T20:25:58.3198494Z 2025-05-07T20:25:58.3198498Z 2025-05-07T20:25:58.3198501Z 2025-05-07T20:25:58.3198505Z 2025-05-07T20:25:58.3198508Z 2025-05-07T20:25:58.3198512Z 2025-05-07T20:25:58.3198515Z 2025-05-07T20:25:58.3198519Z 2025-05-07T20:25:58.3198523Z 2025-05-07T20:25:58.3198526Z 2025-05-07T20:25:58.3198530Z 2025-05-07T20:25:58.3198533Z 2025-05-07T20:25:58.3198537Z 2025-05-07T20:25:58.3198541Z 2025-05-07T20:25:58.3198544Z 2025-05-07T20:25:58.3198706Z  2025-05-07T20:25:58.3198908Z 2025-05-07T20:25:58.3198912Z 2025-05-07T20:25:58.3198916Z 2025-05-07T20:25:58.3198919Z 2025-05-07T20:25:58.3198923Z 2025-05-07T20:25:58.3198927Z 2025-05-07T20:25:58.3198930Z 2025-05-07T20:25:58.3198934Z 2025-05-07T20:25:58.3198945Z 2025-05-07T20:25:58.3199033Z 2025-05-07T20:25:58.3199036Z 2025-05-07T20:25:58.3199040Z 2025-05-07T20:25:58.3199043Z 2025-05-07T20:25:58.3199047Z 2025-05-07T20:25:58.3199051Z 2025-05-07T20:25:58.3199123Z 2025-05-07T20:25:58.3199128Z 2025-05-07T20:25:58.3199131Z 2025-05-07T20:25:58.3199295Z  2025-05-07T20:25:58.3199510Z 2025-05-07T20:25:58.3199513Z 2025-05-07T20:25:58.3199613Z  2025-05-07T20:25:58.3199714Z 2025-05-07T20:25:58.3199718Z 2025-05-07T20:25:58.3199827Z  2025-05-07T20:25:58.3199935Z 2025-05-07T20:25:58.3199939Z 2025-05-07T20:25:58.3199942Z 2025-05-07T20:25:58.3200051Z  2025-05-07T20:25:58.3200160Z 2025-05-07T20:25:58.3200164Z 2025-05-07T20:25:58.3200168Z 2025-05-07T20:25:58.3200172Z 2025-05-07T20:25:58.3200274Z  2025-05-07T20:25:58.3200397Z 2025-05-07T20:25:58.3200400Z 2025-05-07T20:25:58.3200404Z 2025-05-07T20:25:58.3200408Z 2025-05-07T20:25:58.3200411Z 2025-05-07T20:25:58.3200521Z  2025-05-07T20:25:58.3200658Z 2025-05-07T20:25:58.3200662Z 2025-05-07T20:25:58.3200666Z 2025-05-07T20:25:58.3200669Z 2025-05-07T20:25:58.3200673Z 2025-05-07T20:25:58.3200681Z 2025-05-07T20:25:58.3200793Z  2025-05-07T20:25:58.3200928Z 2025-05-07T20:25:58.3200931Z 2025-05-07T20:25:58.3200935Z 2025-05-07T20:25:58.3200939Z 2025-05-07T20:25:58.3200942Z 2025-05-07T20:25:58.3200946Z 2025-05-07T20:25:58.3200950Z 2025-05-07T20:25:58.3201065Z  2025-05-07T20:25:58.3201209Z 2025-05-07T20:25:58.3201213Z 2025-05-07T20:25:58.3201217Z 2025-05-07T20:25:58.3201220Z 2025-05-07T20:25:58.3201224Z 2025-05-07T20:25:58.3201228Z 2025-05-07T20:25:58.3201231Z 2025-05-07T20:25:58.3201235Z 2025-05-07T20:25:58.3201358Z  2025-05-07T20:25:58.3201514Z 2025-05-07T20:25:58.3201518Z 2025-05-07T20:25:58.3201522Z 2025-05-07T20:25:58.3201525Z 2025-05-07T20:25:58.3201529Z 2025-05-07T20:25:58.3201533Z 2025-05-07T20:25:58.3201537Z 2025-05-07T20:25:58.3201540Z 2025-05-07T20:25:58.3201547Z 2025-05-07T20:25:58.3201694Z  2025-05-07T20:25:58.3201845Z 2025-05-07T20:25:58.3201849Z 2025-05-07T20:25:58.3201857Z 2025-05-07T20:25:58.3201861Z 2025-05-07T20:25:58.3201864Z 2025-05-07T20:25:58.3201868Z 2025-05-07T20:25:58.3201872Z 2025-05-07T20:25:58.3201875Z 2025-05-07T20:25:58.3201879Z 2025-05-07T20:25:58.3201882Z 2025-05-07T20:25:58.3202015Z  2025-05-07T20:25:58.3202180Z 2025-05-07T20:25:58.3202183Z 2025-05-07T20:25:58.3202187Z 2025-05-07T20:25:58.3202191Z 2025-05-07T20:25:58.3202194Z 2025-05-07T20:25:58.3202198Z 2025-05-07T20:25:58.3202202Z 2025-05-07T20:25:58.3202205Z 2025-05-07T20:25:58.3202209Z 2025-05-07T20:25:58.3202212Z 2025-05-07T20:25:58.3202216Z 2025-05-07T20:25:58.3202356Z  2025-05-07T20:25:58.3202532Z 2025-05-07T20:25:58.3202536Z 2025-05-07T20:25:58.3202539Z 2025-05-07T20:25:58.3202543Z 2025-05-07T20:25:58.3202547Z 2025-05-07T20:25:58.3202550Z 2025-05-07T20:25:58.3202558Z 2025-05-07T20:25:58.3202562Z 2025-05-07T20:25:58.3202571Z 2025-05-07T20:25:58.3202575Z 2025-05-07T20:25:58.3202578Z 2025-05-07T20:25:58.3202582Z 2025-05-07T20:25:58.3202716Z  2025-05-07T20:25:58.3202897Z 2025-05-07T20:25:58.3202901Z 2025-05-07T20:25:58.3202905Z 2025-05-07T20:25:58.3202916Z 2025-05-07T20:25:58.3202919Z 2025-05-07T20:25:58.3202923Z 2025-05-07T20:25:58.3202927Z 2025-05-07T20:25:58.3202930Z 2025-05-07T20:25:58.3202934Z 2025-05-07T20:25:58.3202938Z 2025-05-07T20:25:58.3202941Z 2025-05-07T20:25:58.3202945Z 2025-05-07T20:25:58.3202949Z 2025-05-07T20:25:58.3203082Z  2025-05-07T20:25:58.3203271Z 2025-05-07T20:25:58.3203275Z 2025-05-07T20:25:58.3203278Z 2025-05-07T20:25:58.3203282Z 2025-05-07T20:25:58.3203286Z 2025-05-07T20:25:58.3203290Z 2025-05-07T20:25:58.3203294Z 2025-05-07T20:25:58.3203297Z 2025-05-07T20:25:58.3203301Z 2025-05-07T20:25:58.3203305Z 2025-05-07T20:25:58.3203309Z 2025-05-07T20:25:58.3203422Z 2025-05-07T20:25:58.3203426Z 2025-05-07T20:25:58.3203430Z 2025-05-07T20:25:58.3203571Z  2025-05-07T20:25:58.3203838Z 2025-05-07T20:25:58.3203842Z 2025-05-07T20:25:58.3203853Z 2025-05-07T20:25:58.3203857Z 2025-05-07T20:25:58.3203861Z 2025-05-07T20:25:58.3203864Z 2025-05-07T20:25:58.3203868Z 2025-05-07T20:25:58.3203872Z 2025-05-07T20:25:58.3203875Z 2025-05-07T20:25:58.3203879Z 2025-05-07T20:25:58.3203883Z 2025-05-07T20:25:58.3203886Z 2025-05-07T20:25:58.3203890Z 2025-05-07T20:25:58.3203894Z 2025-05-07T20:25:58.3203897Z 2025-05-07T20:25:58.3204061Z  2025-05-07T20:25:58.3204253Z 2025-05-07T20:25:58.3204258Z 2025-05-07T20:25:58.3204261Z 2025-05-07T20:25:58.3204265Z 2025-05-07T20:25:58.3204269Z 2025-05-07T20:25:58.3204272Z 2025-05-07T20:25:58.3204276Z 2025-05-07T20:25:58.3204280Z 2025-05-07T20:25:58.3204283Z 2025-05-07T20:25:58.3204287Z 2025-05-07T20:25:58.3204290Z 2025-05-07T20:25:58.3204299Z 2025-05-07T20:25:58.3204303Z 2025-05-07T20:25:58.3204306Z 2025-05-07T20:25:58.3204310Z 2025-05-07T20:25:58.3204319Z 2025-05-07T20:25:58.3204474Z  2025-05-07T20:25:58.3204672Z 2025-05-07T20:25:58.3204676Z 2025-05-07T20:25:58.3204679Z 2025-05-07T20:25:58.3204683Z 2025-05-07T20:25:58.3204687Z 2025-05-07T20:25:58.3204690Z 2025-05-07T20:25:58.3204700Z 2025-05-07T20:25:58.3204704Z 2025-05-07T20:25:58.3204708Z 2025-05-07T20:25:58.3204711Z 2025-05-07T20:25:58.3204715Z 2025-05-07T20:25:58.3204718Z 2025-05-07T20:25:58.3204722Z 2025-05-07T20:25:58.3204725Z 2025-05-07T20:25:58.3204729Z 2025-05-07T20:25:58.3204733Z 2025-05-07T20:25:58.3204736Z 2025-05-07T20:25:58.3204890Z  2025-05-07T20:25:58.3205096Z 2025-05-07T20:25:58.3205100Z 2025-05-07T20:25:58.3205103Z 2025-05-07T20:25:58.3205107Z 2025-05-07T20:25:58.3205111Z 2025-05-07T20:25:58.3205114Z 2025-05-07T20:25:58.3205118Z 2025-05-07T20:25:58.3205125Z 2025-05-07T20:25:58.3205129Z 2025-05-07T20:25:58.3205132Z 2025-05-07T20:25:58.3205136Z 2025-05-07T20:25:58.3205139Z 2025-05-07T20:25:58.3205147Z 2025-05-07T20:25:58.3205151Z 2025-05-07T20:25:58.3205154Z 2025-05-07T20:25:58.3205158Z 2025-05-07T20:25:58.3205162Z 2025-05-07T20:25:58.3205165Z 2025-05-07T20:25:58.3205332Z  2025-05-07T20:25:58.3205535Z 2025-05-07T20:25:58.3205538Z 2025-05-07T20:25:58.3205651Z  2025-05-07T20:25:58.3205751Z 2025-05-07T20:25:58.3205754Z 2025-05-07T20:25:58.3205852Z  2025-05-07T20:25:58.3205964Z 2025-05-07T20:25:58.3205967Z 2025-05-07T20:25:58.3205971Z 2025-05-07T20:25:58.3206072Z  2025-05-07T20:25:58.3206185Z 2025-05-07T20:25:58.3206189Z 2025-05-07T20:25:58.3206192Z 2025-05-07T20:25:58.3206196Z 2025-05-07T20:25:58.3206302Z  2025-05-07T20:25:58.3206420Z 2025-05-07T20:25:58.3206424Z 2025-05-07T20:25:58.3206428Z 2025-05-07T20:25:58.3206437Z 2025-05-07T20:25:58.3206444Z 2025-05-07T20:25:58.3206553Z  2025-05-07T20:25:58.3206674Z 2025-05-07T20:25:58.3206678Z 2025-05-07T20:25:58.3206682Z 2025-05-07T20:25:58.3206690Z 2025-05-07T20:25:58.3206699Z 2025-05-07T20:25:58.3206703Z 2025-05-07T20:25:58.3206815Z  2025-05-07T20:25:58.3206940Z 2025-05-07T20:25:58.3206944Z 2025-05-07T20:25:58.3206948Z 2025-05-07T20:25:58.3206952Z 2025-05-07T20:25:58.3206955Z 2025-05-07T20:25:58.3206959Z 2025-05-07T20:25:58.3206970Z 2025-05-07T20:25:58.3207086Z  2025-05-07T20:25:58.3207221Z 2025-05-07T20:25:58.3207225Z 2025-05-07T20:25:58.3207229Z 2025-05-07T20:25:58.3207233Z 2025-05-07T20:25:58.3207236Z 2025-05-07T20:25:58.3207245Z 2025-05-07T20:25:58.3207249Z 2025-05-07T20:25:58.3207253Z 2025-05-07T20:25:58.3207682Z  2025-05-07T20:25:58.3208006Z 2025-05-07T20:25:58.3208011Z 2025-05-07T20:25:58.3208017Z 2025-05-07T20:25:58.3208021Z 2025-05-07T20:25:58.3208034Z 2025-05-07T20:25:58.3208052Z 2025-05-07T20:25:58.3208230Z 2025-05-07T20:25:58.3208236Z 2025-05-07T20:25:58.3208241Z 2025-05-07T20:25:58.3208437Z  2025-05-07T20:25:58.3208751Z 2025-05-07T20:25:58.3208758Z 2025-05-07T20:25:58.3208763Z 2025-05-07T20:25:58.3208768Z 2025-05-07T20:25:58.3208781Z 2025-05-07T20:25:58.3208787Z 2025-05-07T20:25:58.3208791Z 2025-05-07T20:25:58.3208797Z 2025-05-07T20:25:58.3208802Z 2025-05-07T20:25:58.3208807Z 2025-05-07T20:25:58.3209005Z  2025-05-07T20:25:58.3209237Z 2025-05-07T20:25:58.3209242Z 2025-05-07T20:25:58.3209247Z 2025-05-07T20:25:58.3209252Z 2025-05-07T20:25:58.3209257Z 2025-05-07T20:25:58.3209262Z 2025-05-07T20:25:58.3209267Z 2025-05-07T20:25:58.3209272Z 2025-05-07T20:25:58.3209278Z 2025-05-07T20:25:58.3209282Z 2025-05-07T20:25:58.3209287Z 2025-05-07T20:25:58.3209465Z  2025-05-07T20:25:58.3209709Z 2025-05-07T20:25:58.3209715Z 2025-05-07T20:25:58.3209720Z 2025-05-07T20:25:58.3209725Z 2025-05-07T20:25:58.3209738Z 2025-05-07T20:25:58.3209743Z 2025-05-07T20:25:58.3209757Z 2025-05-07T20:25:58.3209762Z 2025-05-07T20:25:58.3209768Z 2025-05-07T20:25:58.3209779Z 2025-05-07T20:25:58.3209784Z 2025-05-07T20:25:58.3209789Z 2025-05-07T20:25:58.3209977Z  2025-05-07T20:25:58.3210223Z 2025-05-07T20:25:58.3210228Z 2025-05-07T20:25:58.3210233Z 2025-05-07T20:25:58.3210238Z 2025-05-07T20:25:58.3210243Z 2025-05-07T20:25:58.3210248Z 2025-05-07T20:25:58.3210253Z 2025-05-07T20:25:58.3210258Z 2025-05-07T20:25:58.3210263Z 2025-05-07T20:25:58.3210268Z 2025-05-07T20:25:58.3210273Z 2025-05-07T20:25:58.3210279Z 2025-05-07T20:25:58.3210294Z 2025-05-07T20:25:58.3210492Z  2025-05-07T20:25:58.3210750Z 2025-05-07T20:25:58.3210755Z 2025-05-07T20:25:58.3210760Z 2025-05-07T20:25:58.3210765Z 2025-05-07T20:25:58.3210777Z 2025-05-07T20:25:58.3210783Z 2025-05-07T20:25:58.3210789Z 2025-05-07T20:25:58.3210794Z 2025-05-07T20:25:58.3210808Z 2025-05-07T20:25:58.3210813Z 2025-05-07T20:25:58.3210818Z 2025-05-07T20:25:58.3210823Z 2025-05-07T20:25:58.3210828Z 2025-05-07T20:25:58.3210833Z 2025-05-07T20:25:58.3211036Z  2025-05-07T20:25:58.3211315Z 2025-05-07T20:25:58.3211320Z 2025-05-07T20:25:58.3211326Z 2025-05-07T20:25:58.3211331Z 2025-05-07T20:25:58.3211336Z 2025-05-07T20:25:58.3211341Z 2025-05-07T20:25:58.3211347Z 2025-05-07T20:25:58.3211352Z 2025-05-07T20:25:58.3211357Z 2025-05-07T20:25:58.3211362Z 2025-05-07T20:25:58.3211368Z 2025-05-07T20:25:58.3211373Z 2025-05-07T20:25:58.3211378Z 2025-05-07T20:25:58.3211382Z 2025-05-07T20:25:58.3211387Z 2025-05-07T20:25:58.3211618Z  2025-05-07T20:25:58.3211896Z 2025-05-07T20:25:58.3211901Z 2025-05-07T20:25:58.3211906Z 2025-05-07T20:25:58.3211911Z 2025-05-07T20:25:58.3211916Z 2025-05-07T20:25:58.3211921Z 2025-05-07T20:25:58.3211926Z 2025-05-07T20:25:58.3211931Z 2025-05-07T20:25:58.3211936Z 2025-05-07T20:25:58.3211947Z 2025-05-07T20:25:58.3211960Z 2025-05-07T20:25:58.3211965Z 2025-05-07T20:25:58.3211970Z 2025-05-07T20:25:58.3211975Z 2025-05-07T20:25:58.3211985Z 2025-05-07T20:25:58.3211991Z 2025-05-07T20:25:58.3212200Z  2025-05-07T20:25:58.3212502Z 2025-05-07T20:25:58.3212514Z 2025-05-07T20:25:58.3212520Z 2025-05-07T20:25:58.3212526Z 2025-05-07T20:25:58.3212533Z 2025-05-07T20:25:58.3212539Z 2025-05-07T20:25:58.3212545Z 2025-05-07T20:25:58.3212552Z 2025-05-07T20:25:58.3212558Z 2025-05-07T20:25:58.3212564Z 2025-05-07T20:25:58.3212570Z 2025-05-07T20:25:58.3212577Z 2025-05-07T20:25:58.3212583Z 2025-05-07T20:25:58.3212589Z 2025-05-07T20:25:58.3212596Z 2025-05-07T20:25:58.3212602Z 2025-05-07T20:25:58.3212608Z 2025-05-07T20:25:58.3212843Z  2025-05-07T20:25:58.3213127Z 2025-05-07T20:25:58.3213132Z 2025-05-07T20:25:58.3213137Z 2025-05-07T20:25:58.3213142Z 2025-05-07T20:25:58.3213252Z 2025-05-07T20:25:58.3213257Z 2025-05-07T20:25:58.3213262Z 2025-05-07T20:25:58.3213267Z 2025-05-07T20:25:58.3213272Z 2025-05-07T20:25:58.3213277Z 2025-05-07T20:25:58.3213364Z 2025-05-07T20:25:58.3213370Z 2025-05-07T20:25:58.3213383Z 2025-05-07T20:25:58.3213388Z 2025-05-07T20:25:58.3213393Z 2025-05-07T20:25:58.3213398Z 2025-05-07T20:25:58.3213403Z 2025-05-07T20:25:58.3213408Z 2025-05-07T20:25:58.3213650Z  2025-05-07T20:25:58.3213943Z 2025-05-07T20:25:58.3213956Z 2025-05-07T20:25:58.3214099Z  2025-05-07T20:25:58.3214238Z 2025-05-07T20:25:58.3214243Z 2025-05-07T20:25:58.3214385Z  2025-05-07T20:25:58.3214528Z 2025-05-07T20:25:58.3214533Z 2025-05-07T20:25:58.3214538Z 2025-05-07T20:25:58.3214678Z  2025-05-07T20:25:58.3214833Z 2025-05-07T20:25:58.3214838Z 2025-05-07T20:25:58.3214844Z 2025-05-07T20:25:58.3214849Z 2025-05-07T20:25:58.3214996Z  2025-05-07T20:25:58.3215167Z 2025-05-07T20:25:58.3215173Z 2025-05-07T20:25:58.3215187Z 2025-05-07T20:25:58.3215192Z 2025-05-07T20:25:58.3215197Z 2025-05-07T20:25:58.3215355Z  2025-05-07T20:25:58.3215531Z 2025-05-07T20:25:58.3215542Z 2025-05-07T20:25:58.3215548Z 2025-05-07T20:25:58.3215553Z 2025-05-07T20:25:58.3215559Z 2025-05-07T20:25:58.3215564Z 2025-05-07T20:25:58.3215719Z  2025-05-07T20:25:58.3215904Z 2025-05-07T20:25:58.3215909Z 2025-05-07T20:25:58.3215914Z 2025-05-07T20:25:58.3215920Z 2025-05-07T20:25:58.3215924Z 2025-05-07T20:25:58.3215929Z 2025-05-07T20:25:58.3215934Z 2025-05-07T20:25:58.3216090Z  2025-05-07T20:25:58.3216289Z 2025-05-07T20:25:58.3216295Z 2025-05-07T20:25:58.3216300Z 2025-05-07T20:25:58.3216305Z 2025-05-07T20:25:58.3216310Z 2025-05-07T20:25:58.3216315Z 2025-05-07T20:25:58.3216320Z 2025-05-07T20:25:58.3216325Z 2025-05-07T20:25:58.3216484Z  2025-05-07T20:25:58.3216699Z 2025-05-07T20:25:58.3216704Z 2025-05-07T20:25:58.3216709Z 2025-05-07T20:25:58.3216721Z 2025-05-07T20:25:58.3216741Z 2025-05-07T20:25:58.3216746Z 2025-05-07T20:25:58.3216752Z 2025-05-07T20:25:58.3216757Z 2025-05-07T20:25:58.3216762Z 2025-05-07T20:25:58.3216936Z  2025-05-07T20:25:58.3217153Z 2025-05-07T20:25:58.3217159Z 2025-05-07T20:25:58.3217164Z 2025-05-07T20:25:58.3217169Z 2025-05-07T20:25:58.3217175Z 2025-05-07T20:25:58.3217180Z 2025-05-07T20:25:58.3217185Z 2025-05-07T20:25:58.3217190Z 2025-05-07T20:25:58.3217194Z 2025-05-07T20:25:58.3217199Z 2025-05-07T20:25:58.3217379Z  2025-05-07T20:25:58.3217604Z 2025-05-07T20:25:58.3217609Z 2025-05-07T20:25:58.3217614Z 2025-05-07T20:25:58.3217619Z 2025-05-07T20:25:58.3217624Z 2025-05-07T20:25:58.3217630Z 2025-05-07T20:25:58.3217635Z 2025-05-07T20:25:58.3217640Z 2025-05-07T20:25:58.3217645Z 2025-05-07T20:25:58.3217650Z 2025-05-07T20:25:58.3217655Z 2025-05-07T20:25:58.3217838Z  2025-05-07T20:25:58.3218074Z 2025-05-07T20:25:58.3218085Z 2025-05-07T20:25:58.3218090Z 2025-05-07T20:25:58.3218095Z 2025-05-07T20:25:58.3218100Z 2025-05-07T20:25:58.3218105Z 2025-05-07T20:25:58.3218110Z 2025-05-07T20:25:58.3218120Z 2025-05-07T20:25:58.3218125Z 2025-05-07T20:25:58.3218130Z 2025-05-07T20:25:58.3218143Z 2025-05-07T20:25:58.3218148Z 2025-05-07T20:25:58.3218329Z  2025-05-07T20:25:58.3218576Z 2025-05-07T20:25:58.3218580Z 2025-05-07T20:25:58.3218586Z 2025-05-07T20:25:58.3218591Z 2025-05-07T20:25:58.3218596Z 2025-05-07T20:25:58.3218601Z 2025-05-07T20:25:58.3218614Z 2025-05-07T20:25:58.3218619Z 2025-05-07T20:25:58.3218624Z 2025-05-07T20:25:58.3218629Z 2025-05-07T20:25:58.3218634Z 2025-05-07T20:25:58.3218639Z 2025-05-07T20:25:58.3218645Z 2025-05-07T20:25:58.3218845Z  2025-05-07T20:25:58.3219110Z 2025-05-07T20:25:58.3219116Z 2025-05-07T20:25:58.3219121Z 2025-05-07T20:25:58.3219126Z 2025-05-07T20:25:58.3219131Z 2025-05-07T20:25:58.3219136Z 2025-05-07T20:25:58.3219256Z 2025-05-07T20:25:58.3219261Z 2025-05-07T20:25:58.3219266Z 2025-05-07T20:25:58.3219271Z 2025-05-07T20:25:58.3219276Z 2025-05-07T20:25:58.3219281Z 2025-05-07T20:25:58.3219379Z 2025-05-07T20:25:58.3219385Z 2025-05-07T20:25:58.3219598Z  2025-05-07T20:25:58.3219990Z 2025-05-07T20:25:58.3219996Z 2025-05-07T20:25:58.3220001Z 2025-05-07T20:25:58.3220006Z 2025-05-07T20:25:58.3220011Z 2025-05-07T20:25:58.3220016Z 2025-05-07T20:25:58.3220022Z 2025-05-07T20:25:58.3220026Z 2025-05-07T20:25:58.3220032Z 2025-05-07T20:25:58.3220037Z 2025-05-07T20:25:58.3220050Z 2025-05-07T20:25:58.3220056Z 2025-05-07T20:25:58.3220061Z 2025-05-07T20:25:58.3220066Z 2025-05-07T20:25:58.3220071Z 2025-05-07T20:25:58.3220282Z  2025-05-07T20:25:58.3220554Z 2025-05-07T20:25:58.3220559Z 2025-05-07T20:25:58.3220564Z 2025-05-07T20:25:58.3220576Z 2025-05-07T20:25:58.3220581Z 2025-05-07T20:25:58.3220586Z 2025-05-07T20:25:58.3220600Z 2025-05-07T20:25:58.3220605Z 2025-05-07T20:25:58.3220610Z 2025-05-07T20:25:58.3220615Z 2025-05-07T20:25:58.3220620Z 2025-05-07T20:25:58.3220625Z 2025-05-07T20:25:58.3220636Z 2025-05-07T20:25:58.3220641Z 2025-05-07T20:25:58.3220647Z 2025-05-07T20:25:58.3220652Z 2025-05-07T20:25:58.3220865Z  2025-05-07T20:25:58.3221153Z 2025-05-07T20:25:58.3221158Z 2025-05-07T20:25:58.3221164Z 2025-05-07T20:25:58.3221169Z 2025-05-07T20:25:58.3221174Z 2025-05-07T20:25:58.3221179Z 2025-05-07T20:25:58.3221184Z 2025-05-07T20:25:58.3221189Z 2025-05-07T20:25:58.3221193Z 2025-05-07T20:25:58.3221198Z 2025-05-07T20:25:58.3221203Z 2025-05-07T20:25:58.3221208Z 2025-05-07T20:25:58.3221213Z 2025-05-07T20:25:58.3221218Z 2025-05-07T20:25:58.3221223Z 2025-05-07T20:25:58.3221228Z 2025-05-07T20:25:58.3221233Z 2025-05-07T20:25:58.3221459Z  2025-05-07T20:25:58.3221746Z 2025-05-07T20:25:58.3221752Z 2025-05-07T20:25:58.3221764Z 2025-05-07T20:25:58.3221769Z 2025-05-07T20:25:58.3221774Z 2025-05-07T20:25:58.3221778Z 2025-05-07T20:25:58.3221783Z 2025-05-07T20:25:58.3221789Z 2025-05-07T20:25:58.3221808Z 2025-05-07T20:25:58.3221813Z 2025-05-07T20:25:58.3221818Z 2025-05-07T20:25:58.3221823Z 2025-05-07T20:25:58.3221828Z 2025-05-07T20:25:58.3221833Z 2025-05-07T20:25:58.3221838Z 2025-05-07T20:25:58.3221843Z 2025-05-07T20:25:58.3221848Z 2025-05-07T20:25:58.3221854Z 2025-05-07T20:25:58.3222083Z  2025-05-07T20:25:58.3222387Z 2025-05-07T20:25:58.3222392Z 2025-05-07T20:25:58.3222569Z  2025-05-07T20:25:58.3222715Z 2025-05-07T20:25:58.3222721Z 2025-05-07T20:25:58.3222862Z  2025-05-07T20:25:58.3223017Z 2025-05-07T20:25:58.3223022Z 2025-05-07T20:25:58.3223028Z 2025-05-07T20:25:58.3223171Z  2025-05-07T20:25:58.3223327Z 2025-05-07T20:25:58.3223332Z 2025-05-07T20:25:58.3223337Z 2025-05-07T20:25:58.3223342Z 2025-05-07T20:25:58.3223483Z  2025-05-07T20:25:58.3223646Z 2025-05-07T20:25:58.3223651Z 2025-05-07T20:25:58.3223656Z 2025-05-07T20:25:58.3223661Z 2025-05-07T20:25:58.3223673Z 2025-05-07T20:25:58.3223824Z  2025-05-07T20:25:58.3223989Z 2025-05-07T20:25:58.3223994Z 2025-05-07T20:25:58.3223999Z 2025-05-07T20:25:58.3224004Z 2025-05-07T20:25:58.3224009Z 2025-05-07T20:25:58.3224014Z 2025-05-07T20:25:58.3224171Z  2025-05-07T20:25:58.3224342Z 2025-05-07T20:25:58.3224347Z 2025-05-07T20:25:58.3224352Z 2025-05-07T20:25:58.3224357Z 2025-05-07T20:25:58.3224362Z 2025-05-07T20:25:58.3224368Z 2025-05-07T20:25:58.3224373Z 2025-05-07T20:25:58.3224539Z  2025-05-07T20:25:58.3224728Z 2025-05-07T20:25:58.3224734Z 2025-05-07T20:25:58.3224739Z 2025-05-07T20:25:58.3224744Z 2025-05-07T20:25:58.3224750Z 2025-05-07T20:25:58.3224755Z 2025-05-07T20:25:58.3224760Z 2025-05-07T20:25:58.3224765Z 2025-05-07T20:25:58.3224937Z  2025-05-07T20:25:58.3225138Z 2025-05-07T20:25:58.3225289Z 2025-05-07T20:25:58.3225294Z 2025-05-07T20:25:58.3225299Z 2025-05-07T20:25:58.3225304Z 2025-05-07T20:25:58.3225309Z 2025-05-07T20:25:58.3225394Z 2025-05-07T20:25:58.3225400Z 2025-05-07T20:25:58.3225405Z 2025-05-07T20:25:58.3225586Z  2025-05-07T20:25:58.3225803Z 2025-05-07T20:25:58.3225809Z 2025-05-07T20:25:58.3225814Z 2025-05-07T20:25:58.3225819Z 2025-05-07T20:25:58.3225824Z 2025-05-07T20:25:58.3225829Z 2025-05-07T20:25:58.3225834Z 2025-05-07T20:25:58.3225839Z 2025-05-07T20:25:58.3225852Z 2025-05-07T20:25:58.3225857Z 2025-05-07T20:25:58.3226034Z  2025-05-07T20:25:58.3226258Z 2025-05-07T20:25:58.3226263Z 2025-05-07T20:25:58.3226268Z 2025-05-07T20:25:58.3226273Z 2025-05-07T20:25:58.3226278Z 2025-05-07T20:25:58.3226290Z 2025-05-07T20:25:58.3226295Z 2025-05-07T20:25:58.3226300Z 2025-05-07T20:25:58.3226305Z 2025-05-07T20:25:58.3226310Z 2025-05-07T20:25:58.3226315Z 2025-05-07T20:25:58.3226489Z  2025-05-07T20:25:58.3226731Z 2025-05-07T20:25:58.3226746Z 2025-05-07T20:25:58.3226751Z 2025-05-07T20:25:58.3226756Z 2025-05-07T20:25:58.3226761Z 2025-05-07T20:25:58.3226773Z 2025-05-07T20:25:58.3226778Z 2025-05-07T20:25:58.3226783Z 2025-05-07T20:25:58.3226788Z 2025-05-07T20:25:58.3226794Z 2025-05-07T20:25:58.3226799Z 2025-05-07T20:25:58.3226804Z 2025-05-07T20:25:58.3226984Z  2025-05-07T20:25:58.3227238Z 2025-05-07T20:25:58.3227243Z 2025-05-07T20:25:58.3227248Z 2025-05-07T20:25:58.3227253Z 2025-05-07T20:25:58.3227258Z 2025-05-07T20:25:58.3227263Z 2025-05-07T20:25:58.3227268Z 2025-05-07T20:25:58.3227274Z 2025-05-07T20:25:58.3227279Z 2025-05-07T20:25:58.3227284Z 2025-05-07T20:25:58.3227289Z 2025-05-07T20:25:58.3227294Z 2025-05-07T20:25:58.3227299Z 2025-05-07T20:25:58.3227487Z  2025-05-07T20:25:58.3227745Z 2025-05-07T20:25:58.3227750Z 2025-05-07T20:25:58.3227755Z 2025-05-07T20:25:58.3227760Z 2025-05-07T20:25:58.3227772Z 2025-05-07T20:25:58.3227777Z 2025-05-07T20:25:58.3227782Z 2025-05-07T20:25:58.3227788Z 2025-05-07T20:25:58.3227792Z 2025-05-07T20:25:58.3227801Z 2025-05-07T20:25:58.3227804Z 2025-05-07T20:25:58.3227808Z 2025-05-07T20:25:58.3227812Z 2025-05-07T20:25:58.3227815Z 2025-05-07T20:25:58.3227967Z  2025-05-07T20:25:58.3228154Z 2025-05-07T20:25:58.3228158Z 2025-05-07T20:25:58.3228161Z 2025-05-07T20:25:58.3228165Z 2025-05-07T20:25:58.3228169Z 2025-05-07T20:25:58.3228173Z 2025-05-07T20:25:58.3228176Z 2025-05-07T20:25:58.3228180Z 2025-05-07T20:25:58.3228190Z 2025-05-07T20:25:58.3228194Z 2025-05-07T20:25:58.3228198Z 2025-05-07T20:25:58.3228201Z 2025-05-07T20:25:58.3228205Z 2025-05-07T20:25:58.3228208Z 2025-05-07T20:25:58.3228212Z 2025-05-07T20:25:58.3228382Z  2025-05-07T20:25:58.3228574Z 2025-05-07T20:25:58.3228578Z 2025-05-07T20:25:58.3228581Z 2025-05-07T20:25:58.3228585Z 2025-05-07T20:25:58.3228592Z 2025-05-07T20:25:58.3228595Z 2025-05-07T20:25:58.3228599Z 2025-05-07T20:25:58.3228602Z 2025-05-07T20:25:58.3228606Z 2025-05-07T20:25:58.3228614Z 2025-05-07T20:25:58.3228617Z 2025-05-07T20:25:58.3228626Z 2025-05-07T20:25:58.3228630Z 2025-05-07T20:25:58.3228633Z 2025-05-07T20:25:58.3228637Z 2025-05-07T20:25:58.3228640Z 2025-05-07T20:25:58.3228789Z  2025-05-07T20:25:58.3228986Z 2025-05-07T20:25:58.3228990Z 2025-05-07T20:25:58.3228994Z 2025-05-07T20:25:58.3229003Z 2025-05-07T20:25:58.3229006Z 2025-05-07T20:25:58.3229010Z 2025-05-07T20:25:58.3229013Z 2025-05-07T20:25:58.3229017Z 2025-05-07T20:25:58.3229021Z 2025-05-07T20:25:58.3229024Z 2025-05-07T20:25:58.3229028Z 2025-05-07T20:25:58.3229031Z 2025-05-07T20:25:58.3229035Z 2025-05-07T20:25:58.3229038Z 2025-05-07T20:25:58.3229042Z 2025-05-07T20:25:58.3229045Z 2025-05-07T20:25:58.3229049Z 2025-05-07T20:25:58.3229201Z  2025-05-07T20:25:58.3229506Z 2025-05-07T20:25:58.3229510Z 2025-05-07T20:25:58.3229513Z 2025-05-07T20:25:58.3229517Z 2025-05-07T20:25:58.3229521Z 2025-05-07T20:25:58.3229645Z 2025-05-07T20:25:58.3229649Z 2025-05-07T20:25:58.3229653Z 2025-05-07T20:25:58.3229656Z 2025-05-07T20:25:58.3229660Z 2025-05-07T20:25:58.3229664Z 2025-05-07T20:25:58.3229667Z 2025-05-07T20:25:58.3229671Z 2025-05-07T20:25:58.3229674Z 2025-05-07T20:25:58.3229678Z 2025-05-07T20:25:58.3229682Z 2025-05-07T20:25:58.3229685Z 2025-05-07T20:25:58.3229695Z 2025-05-07T20:25:58.3229859Z  2025-05-07T20:25:58.3230062Z 2025-05-07T20:25:58.3230066Z 2025-05-07T20:25:58.3230167Z  2025-05-07T20:25:58.3230266Z 2025-05-07T20:25:58.3230270Z 2025-05-07T20:25:58.3230366Z  2025-05-07T20:25:58.3230477Z 2025-05-07T20:25:58.3230481Z 2025-05-07T20:25:58.3230485Z 2025-05-07T20:25:58.3230586Z  2025-05-07T20:25:58.3230736Z 2025-05-07T20:25:58.3230741Z 2025-05-07T20:25:58.3230746Z 2025-05-07T20:25:58.3230759Z 2025-05-07T20:25:58.3230921Z  2025-05-07T20:25:58.3231091Z 2025-05-07T20:25:58.3231096Z 2025-05-07T20:25:58.3231110Z 2025-05-07T20:25:58.3231123Z 2025-05-07T20:25:58.3231128Z 2025-05-07T20:25:58.3231282Z  2025-05-07T20:25:58.3231450Z 2025-05-07T20:25:58.3231455Z 2025-05-07T20:25:58.3231460Z 2025-05-07T20:25:58.3231465Z 2025-05-07T20:25:58.3231470Z 2025-05-07T20:25:58.3231482Z 2025-05-07T20:25:58.3231632Z  2025-05-07T20:25:58.3231804Z 2025-05-07T20:25:58.3231809Z 2025-05-07T20:25:58.3231814Z 2025-05-07T20:25:58.3231819Z 2025-05-07T20:25:58.3231824Z 2025-05-07T20:25:58.3231829Z 2025-05-07T20:25:58.3231834Z 2025-05-07T20:25:58.3231996Z  2025-05-07T20:25:58.3232137Z 2025-05-07T20:25:58.3232141Z 2025-05-07T20:25:58.3232144Z 2025-05-07T20:25:58.3232148Z 2025-05-07T20:25:58.3232152Z 2025-05-07T20:25:58.3232156Z 2025-05-07T20:25:58.3232159Z 2025-05-07T20:25:58.3232163Z 2025-05-07T20:25:58.3232288Z  2025-05-07T20:25:58.3232439Z 2025-05-07T20:25:58.3232442Z 2025-05-07T20:25:58.3232446Z 2025-05-07T20:25:58.3232449Z 2025-05-07T20:25:58.3232456Z 2025-05-07T20:25:58.3232460Z 2025-05-07T20:25:58.3232464Z 2025-05-07T20:25:58.3232473Z 2025-05-07T20:25:58.3232477Z 2025-05-07T20:25:58.3232599Z  2025-05-07T20:25:58.3232752Z 2025-05-07T20:25:58.3232756Z 2025-05-07T20:25:58.3232760Z 2025-05-07T20:25:58.3232763Z 2025-05-07T20:25:58.3232767Z 2025-05-07T20:25:58.3232771Z 2025-05-07T20:25:58.3232781Z 2025-05-07T20:25:58.3232785Z 2025-05-07T20:25:58.3232788Z 2025-05-07T20:25:58.3232792Z 2025-05-07T20:25:58.3232917Z  2025-05-07T20:25:58.3233079Z 2025-05-07T20:25:58.3233082Z 2025-05-07T20:25:58.3233086Z 2025-05-07T20:25:58.3233096Z 2025-05-07T20:25:58.3233100Z 2025-05-07T20:25:58.3233104Z 2025-05-07T20:25:58.3233107Z 2025-05-07T20:25:58.3233111Z 2025-05-07T20:25:58.3233114Z 2025-05-07T20:25:58.3233118Z 2025-05-07T20:25:58.3233125Z 2025-05-07T20:25:58.3233262Z  done 2025-05-07T20:25:58.6444232Z Preparing transaction: - \ | done 2025-05-07T20:26:00.0915543Z Verifying transaction: - \ | / - \ | / - \ | / - \ done 2025-05-07T20:26:00.9317202Z Executing transaction: / - \ | / - \ | done 2025-05-07T20:26:03.2862313Z [INSTALL] Fixing file placements for CUDA 12.6.3+ ... 2025-05-07T20:26:03.2862857Z [INSTALL] Creating symlinks: libnvToolsExt.so 2025-05-07T20:26:03.2863803Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/lib/libnvToolsExt.so.1 /home/ec2-user/miniconda/envs/build_binary/lib/libnvToolsExt.so 2025-05-07T20:26:03.2864585Z 2025-05-07T20:26:03.2876838Z 2025-05-07T20:26:03.2877763Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/libnvToolsExt.so.1 /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/libnvToolsExt.so 2025-05-07T20:26:03.2878479Z 2025-05-07T20:26:03.2891392Z 2025-05-07T20:26:03.2891924Z [INSTALL] Copying nvtx3 headers ... 2025-05-07T20:26:03.2896173Z + cp -r /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCuda.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCudaRt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtOpenCL.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtSync.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvtx3.hpp /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvtxDetail /home/ec2-user/miniconda/envs/build_binary/include/ 2025-05-07T20:26:03.2900056Z 2025-05-07T20:26:03.4479949Z 2025-05-07T20:26:03.4486728Z + cp -r /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCuda.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCudaRt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtOpenCL.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtSync.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvtx3.hpp /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvtxDetail /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/include/ 2025-05-07T20:26:03.4490847Z 2025-05-07T20:26:03.4512679Z 2025-05-07T20:26:03.4513043Z [INSTALL] Appending libcuda.so path to LD_LIBRARY_PATH ... 2025-05-07T20:26:03.4892201Z [ENV] Appending to LD_LIBRARY_PATH: /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs ... 2025-05-07T20:26:05.3802701Z ERROR conda.cli.main_run:execute(125): `conda run printenv LD_LIBRARY_PATH` failed. (See above for error) 2025-05-07T20:26:05.4449141Z + conda env config vars set -n build_binary LD_LIBRARY_PATH=/home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs 2025-05-07T20:26:05.4449648Z 2025-05-07T20:26:05.8702635Z 2025-05-07T20:26:05.8711925Z [INSTALL] Setting environment variable NVML_LIB_PATH ... 2025-05-07T20:26:05.9063305Z + conda env config vars set -n build_binary NVML_LIB_PATH=/home/ec2-user/miniconda/envs/build_binary/lib/stubs/libnvidia-ml.so 2025-05-07T20:26:05.9063794Z 2025-05-07T20:26:06.3421154Z 2025-05-07T20:26:06.3421697Z [INSTALL] Setting environment variable CUDA_INCLUDE_DIRS ... 2025-05-07T20:26:06.3422694Z + conda env config vars set -n build_binary CUDA_INCLUDE_DIRS="/home/ec2-user/miniconda/envs/build_binary/include/:/home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/include/" 2025-05-07T20:26:06.3423392Z 2025-05-07T20:26:06.7700795Z 2025-05-07T20:26:08.8036812Z [CHECK] cuda_runtime.h found in CONDA_PREFIX PATH (file): /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/include/cuda_runtime.h 2025-05-07T20:26:10.8270397Z [CHECK] libcuda.so found in CONDA_PREFIX PATH (file): /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs/libcuda.so 2025-05-07T20:26:12.8506326Z [CHECK] libnvToolsExt.so found in CONDA_PREFIX PATH (symbolic link): /home/ec2-user/miniconda/envs/build_binary/lib/libnvToolsExt.so 2025-05-07T20:26:12.8507129Z /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/libnvToolsExt.so 2025-05-07T20:26:14.8775480Z [CHECK] libnvidia-ml.so found in CONDA_PREFIX PATH (file): /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs/libnvidia-ml.so 2025-05-07T20:26:16.7792339Z /home/ec2-user/miniconda/envs/build_binary/bin/nvcc 2025-05-07T20:26:16.7794057Z 2025-05-07T20:26:16.8427169Z [CHECK] Binary nvcc found in PATH 2025-05-07T20:26:20.6893867Z /tmp/tmpxr5mqe6j: line 3: clang: command not found 2025-05-07T20:26:20.6894160Z 2025-05-07T20:26:20.6894744Z ERROR conda.cli.main_run:execute(125): `conda run clang --version` failed. (See above for error) 2025-05-07T20:26:20.7525293Z + ls -la /home/ec2-user/miniconda/envs/build_binary/etc/conda/activate.d 2025-05-07T20:26:20.7525619Z 2025-05-07T20:26:20.7546373Z total 36 2025-05-07T20:26:20.7546754Z drwxr-xr-x. 2 ec2-user ec2-user 191 May 7 20:26 . 2025-05-07T20:26:20.7547290Z drwxr-xr-x. 5 ec2-user ec2-user 62 May 7 20:24 .. 2025-05-07T20:26:20.7547854Z -rw-r--r--. 2 ec2-user ec2-user 3778 Jun 10 2024 activate-binutils_linux-64.sh 2025-05-07T20:26:20.7548358Z -rw-r--r--. 2 ec2-user ec2-user 11630 Jun 10 2024 activate-gcc_linux-64.sh 2025-05-07T20:26:20.7549215Z -rw-r--r--. 2 ec2-user ec2-user 5190 Jun 10 2024 activate-gxx_linux-64.sh 2025-05-07T20:26:20.7549698Z -rw-r--r--. 2 ec2-user ec2-user 136 Mar 27 01:27 libglib_activate.sh 2025-05-07T20:26:20.7550141Z -rw-r--r--. 2 ec2-user ec2-user 872 Nov 13 09:20 libxml2_activate.sh 2025-05-07T20:26:20.7550595Z -rw-r--r--. 2 ec2-user ec2-user 2932 Nov 20 20:32 ~cuda-nvcc_activate.sh 2025-05-07T20:26:20.7550882Z 2025-05-07T20:26:20.7551097Z [INSTALL] Removing the -ccbin=CXX hook from NVCC activation scripts ... 2025-05-07T20:26:20.7551732Z + sed -i /-ccbin=/d /home/ec2-user/miniconda/envs/build_binary/etc/conda/activate.d/*cuda-nvcc_activate.sh 2025-05-07T20:26:20.7552153Z 2025-05-07T20:26:20.7574059Z 2025-05-07T20:26:20.7574370Z + conda run -n build_binary c++ --version | grep -i clang 2025-05-07T20:26:20.7574624Z 2025-05-07T20:26:22.7159080Z 2025-05-07T20:26:22.7159719Z [BUILD] Setting prepend flags for NVCC ... 2025-05-07T20:26:22.7160277Z + conda env config vars set -n build_binary NVCC_PREPEND_FLAGS="-allow-unsupported-compiler" 2025-05-07T20:26:22.7160676Z 2025-05-07T20:26:23.1471063Z 2025-05-07T20:26:23.1471457Z + conda run -n build_binary printenv NVCC_PREPEND_FLAGS 2025-05-07T20:26:23.1471726Z 2025-05-07T20:26:25.0446515Z -allow-unsupported-compiler 2025-05-07T20:26:25.0446836Z 2025-05-07T20:26:25.1085911Z 2025-05-07T20:26:25.1086677Z [INFO] Printing out all preprocessor defines in nvcc ... 2025-05-07T20:26:25.1087455Z + conda run -n build_binary nvcc --compiler-options -dM -E -x cu - < /dev/null 2025-05-07T20:26:25.1087876Z 2025-05-07T20:26:27.0726803Z #define _GLIBCXX_DEPRECATED_SUGGEST(ALT) __attribute__ ((__deprecated__ ("use '" ALT "' instead"))) 2025-05-07T20:26:27.0727571Z #define M_PIl 3.141592653589793238462643383279502884L 2025-05-07T20:26:27.0727997Z #define _IO_CURRENTLY_PUTTING 0x800 2025-05-07T20:26:27.0728319Z #define __W_EXITCODE(ret,sig) ((ret) << 8 | (sig)) 2025-05-07T20:26:27.0728639Z #define __DBL_MIN_EXP__ (-1021) 2025-05-07T20:26:27.0728896Z #define _STL_PAIR_H 1 2025-05-07T20:26:27.0729174Z #define __cpp_attributes 200809L 2025-05-07T20:26:27.0729499Z #define __cpp_nontype_template_parameter_auto 201606L 2025-05-07T20:26:27.0729840Z #define __DELETE_THROW throw() 2025-05-07T20:26:27.0730098Z #define _PTRDIFF_T_ 2025-05-07T20:26:27.0730336Z #define M_PI_4 0.78539816339744830962 2025-05-07T20:26:27.0730700Z #define __UINT_LEAST16_MAX__ 0xffff 2025-05-07T20:26:27.0731096Z #define _IO_LEFT 02 2025-05-07T20:26:27.0731426Z #define __ATOMIC_ACQUIRE 2 2025-05-07T20:26:27.0731689Z #define _POSIX2_BC_SCALE_MAX 99 2025-05-07T20:26:27.0732062Z #define _GLIBCXX_USE_RANDOM_TR1 1 2025-05-07T20:26:27.0732660Z #define _GLIBCXX_MOVE_BACKWARD3(_Tp,_Up,_Vp) std::move_backward(_Tp, _Up, _Vp) 2025-05-07T20:26:27.0733254Z #define __FLT128_MAX_10_EXP__ 4932 2025-05-07T20:26:27.0733548Z #define RE_DUP_MAX (0x7fff) 2025-05-07T20:26:27.0733805Z #define _IOS_OUTPUT 2 2025-05-07T20:26:27.0734101Z #define __FLT_MIN__ 1.17549435082228750796873653722224568e-38F 2025-05-07T20:26:27.0734458Z #define toascii_l(c,l) __toascii_l ((c), (l)) 2025-05-07T20:26:27.0735091Z #define __GCC_IEC_559_COMPLEX 2 2025-05-07T20:26:27.0735364Z #define _GLIBCXX_USE_FCHMOD 1 2025-05-07T20:26:27.0735888Z #define __cpp_aggregate_nsdmi 201304L 2025-05-07T20:26:27.0736987Z #define __bswap_16(x) (__extension__ ({ unsigned short int __v, __x = (unsigned short int) (x); if (__builtin_constant_p (__x)) __v = __bswap_constant_16 (__x); else __asm__ ("rorw $8, %w0" : "=r" (__v) : "0" (__x) : "cc"); __v; })) 2025-05-07T20:26:27.0738151Z #define __UINT_LEAST8_TYPE__ unsigned char 2025-05-07T20:26:27.0738575Z #define __SIZEOF_FLOAT80__ 16 2025-05-07T20:26:27.0739009Z #define cudaTextureTypeCubemapLayered 0xFC 2025-05-07T20:26:27.0739485Z #define _T_WCHAR_ 2025-05-07T20:26:27.0739796Z #define stdout stdout 2025-05-07T20:26:27.0740458Z #define _GLIBCXX_ABI_TAG_CXX11 __attribute ((__abi_tag__ ("cxx11"))) 2025-05-07T20:26:27.0740979Z #define CHAR_BIT __CHAR_BIT__ 2025-05-07T20:26:27.0741333Z #define __flexarr [] 2025-05-07T20:26:27.0741673Z #define _GLIBCXX_HAVE_FINITEF 1 2025-05-07T20:26:27.0742114Z #define __islower_l(c,l) __isctype_l((c), _ISlower, (l)) 2025-05-07T20:26:27.0742635Z #define _IO_FLAGS2_USER_WBUF 8 2025-05-07T20:26:27.0743055Z #define _MATH_H 1 2025-05-07T20:26:27.0743441Z #define cudaOccupancyDisableCachingOverride 0x01 2025-05-07T20:26:27.0743919Z #define __S64_TYPE long int 2025-05-07T20:26:27.0744265Z #define __stub_fchflags 2025-05-07T20:26:27.0744631Z #define cudaDeviceScheduleMask 0x07 2025-05-07T20:26:27.0745028Z #define __SQUAD_TYPE long int 2025-05-07T20:26:27.0745395Z #define __INTMAX_C(c) c ## L 2025-05-07T20:26:27.0745763Z #define _BSD_SIZE_T_DEFINED_ 2025-05-07T20:26:27.0746110Z #define NL_NMAX INT_MAX 2025-05-07T20:26:27.0746456Z #define _BITS_TIME_H 1 2025-05-07T20:26:27.0746843Z #define M_LN10l 2.302585092994045684017991454684364208L 2025-05-07T20:26:27.0747278Z #define _GLIBCXX_TXN_SAFE_DYN 2025-05-07T20:26:27.0747709Z #define cudaStreamTailLaunch ((cudaStream_t)0x3) 2025-05-07T20:26:27.0748200Z #define M_El 2.718281828459045235360287471352662498L 2025-05-07T20:26:27.0748749Z #define _PSTL_PRAGMA_DECLARE_SIMD _PSTL_PRAGMA(omp declare simd) 2025-05-07T20:26:27.0749250Z #define __CHAR_BIT__ 8 2025-05-07T20:26:27.0749606Z #define __FSWORD_T_TYPE __SYSCALL_SLONG_TYPE 2025-05-07T20:26:27.0750038Z #define _PSTL_STRING_CONCAT(x,y) x #y 2025-05-07T20:26:27.0750433Z #define _GLIBCXX98_USE_C99_MATH 1 2025-05-07T20:26:27.0750798Z #define FP_NAN 0 2025-05-07T20:26:27.0751197Z #define makedev(maj,min) gnu_dev_makedev (maj, min) 2025-05-07T20:26:27.0751802Z #define __glibcxx_requires_sorted_set_pred(_First1,_Last1,_First2,_Pred) 2025-05-07T20:26:27.0752483Z #define cudaGetDeviceProperties cudaGetDeviceProperties_v2 2025-05-07T20:26:27.0752985Z #define __cudaCDP2GetErrorString 2025-05-07T20:26:27.0753277Z #define SHRT_MAX __SHRT_MAX__ 2025-05-07T20:26:27.0753548Z #define _GLIBCXX_X86_RDSEED 1 2025-05-07T20:26:27.0753805Z #define __SM_80_RT_H__ 2025-05-07T20:26:27.0754026Z #define _NEW 2025-05-07T20:26:27.0754256Z #define CLOCK_PROCESS_CPUTIME_ID 2 2025-05-07T20:26:27.0754550Z #define __UINT8_MAX__ 0xff 2025-05-07T20:26:27.0754961Z #define _PSTL_ASSERT_MSG(_Condition,_Message) __glibcxx_assert(_Condition) 2025-05-07T20:26:27.0755413Z #define __SCHAR_WIDTH__ 8 2025-05-07T20:26:27.0755658Z #define __USE_ANSI 1 2025-05-07T20:26:27.0755948Z #define _IO_BE(expr,res) __builtin_expect ((expr), res) 2025-05-07T20:26:27.0756460Z #define __isupper_l(c,l) __isctype_l((c), _ISupper, (l)) 2025-05-07T20:26:27.0756825Z #define __cudaCDP2Memcpy2DAsync_ptsz 2025-05-07T20:26:27.0757137Z #define __WINT_MAX__ 0xffffffffU 2025-05-07T20:26:27.0757517Z #define __SIZEOF_PTHREAD_ATTR_T 56 2025-05-07T20:26:27.0757896Z #define __FLT32_MIN_EXP__ (-125) 2025-05-07T20:26:27.0758284Z #define _GLIBCXX_END_NAMESPACE_LDBL 2025-05-07T20:26:27.0758634Z #define PIPE_BUF 4096 2025-05-07T20:26:27.0758962Z #define _PSTL_PRAGMA_SIMD_ORDERED_MONOTONIC_2ARGS(PRM1,PRM2) 2025-05-07T20:26:27.0759321Z #define ADJ_TICK 0x4000 2025-05-07T20:26:27.0759606Z #define _PSTL_VERSION_PATCH (_PSTL_VERSION % 10) 2025-05-07T20:26:27.0760081Z #define MQ_PRIO_MAX 32768 2025-05-07T20:26:27.0760342Z #define __SIZEOF_PTHREAD_MUTEXATTR_T 4 2025-05-07T20:26:27.0760746Z #define __WAIT_INT(status) (*(int *) &(status)) 2025-05-07T20:26:27.0761198Z #define __GLIBC_PREREQ(maj,min) ((__GLIBC__ << 16) + __GLIBC_MINOR__ >= ((maj) << 16) + (min)) 2025-05-07T20:26:27.0761725Z #define cudaCooperativeLaunchMultiDeviceNoPreSync 0x01 2025-05-07T20:26:27.0762091Z #define _XOPEN_SOURCE 700 2025-05-07T20:26:27.0762340Z #define _POSIX2_BC_DIM_MAX 2048 2025-05-07T20:26:27.0762618Z #define __VECTOR_FUNCTIONS_HPP__ 2025-05-07T20:26:27.0762904Z #define __cpp_static_assert 201411L 2025-05-07T20:26:27.0763241Z #define __WEXITSTATUS(status) (((status) & 0xff00) >> 8) 2025-05-07T20:26:27.0763583Z #define _GLIBCXX_HAVE_STRXFRM_L 1 2025-05-07T20:26:27.0763864Z #define _POSIX_TTY_NAME_MAX 9 2025-05-07T20:26:27.0764148Z #define _GLIBCXX_USE_WEAK_REF __GXX_WEAK__ 2025-05-07T20:26:27.0764445Z #define __OFF_T_MATCHES_OFF64_T 1 2025-05-07T20:26:27.0764733Z #define __ORDER_LITTLE_ENDIAN__ 1234 2025-05-07T20:26:27.0765033Z #define __SIZE_MAX__ 0xffffffffffffffffUL 2025-05-07T20:26:27.0765389Z #define __ispunct_l(c,l) __isctype_l((c), _ISpunct, (l)) 2025-05-07T20:26:27.0765736Z #define __WCHAR_MAX__ 0x7fffffff 2025-05-07T20:26:27.0766023Z #define _GLIBCXX_USE_CLOCK_MONOTONIC 1 2025-05-07T20:26:27.0766330Z #define __BLKCNT_T_TYPE __SYSCALL_SLONG_TYPE 2025-05-07T20:26:27.0766691Z #define __isprint_l(c,l) __isctype_l((c), _ISprint, (l)) 2025-05-07T20:26:27.0767046Z #define cudaNvSciSyncAttrSignal 0x1 2025-05-07T20:26:27.0767334Z #define _GLIBCXX_USE_LONG_LONG 1 2025-05-07T20:26:27.0767631Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_1 1 2025-05-07T20:26:27.0767963Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_2 1 2025-05-07T20:26:27.0768287Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4 1 2025-05-07T20:26:27.0768680Z #define __DBL_DENORM_MIN__ double(4.94065645841246544176568792868221372e-324L) 2025-05-07T20:26:27.0769095Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_8 1 2025-05-07T20:26:27.0769402Z #define ADJ_ESTERROR 0x0008 2025-05-07T20:26:27.0769663Z #define __GCC_ATOMIC_CHAR_LOCK_FREE 2 2025-05-07T20:26:27.0769952Z #define __GCC_IEC_559 2 2025-05-07T20:26:27.0770248Z #define __cpp_lib_transformation_trait_aliases 201304 2025-05-07T20:26:27.0770581Z #define _IO_flockfile(_fp) 2025-05-07T20:26:27.0770848Z #define CLOCK_MONOTONIC_RAW 4 2025-05-07T20:26:27.0771120Z #define __FLT32X_DECIMAL_DIG__ 17 2025-05-07T20:26:27.0771379Z #define _IOFBF 0 2025-05-07T20:26:27.0771601Z #define __USE_BSD 1 2025-05-07T20:26:27.0771832Z #define __FLT_EVAL_METHOD__ 0 2025-05-07T20:26:27.0772104Z #define SHRT_MIN (-SHRT_MAX - 1) 2025-05-07T20:26:27.0772374Z #define _IO_USER_LOCK 0x8000 2025-05-07T20:26:27.0772632Z #define _IO_NO_WRITES 8 2025-05-07T20:26:27.0772887Z #define _GLIBCXX_PSEUDO_VISIBILITY(V) 2025-05-07T20:26:27.0773235Z #define __ASMNAME2(prefix,cname) __STRING (prefix) cname 2025-05-07T20:26:27.0773583Z #define _GLIBCXX_HAVE_SYS_STAT_H 1 2025-05-07T20:26:27.0773897Z #define MB_CUR_MAX (__ctype_get_mb_cur_max ()) 2025-05-07T20:26:27.0774207Z #define __cpp_binary_literals 201304L 2025-05-07T20:26:27.0774505Z #define _CPP_TYPE_TRAITS_H 1 2025-05-07T20:26:27.0774778Z #define __BEGIN_NAMESPACE_C99 2025-05-07T20:26:27.0775042Z #define __FLT64_DECIMAL_DIG__ 17 2025-05-07T20:26:27.0775355Z #define _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(A) 2025-05-07T20:26:27.0775735Z #define _G_HAVE_ST_BLKSIZE defined (_STATBUF_ST_BLKSIZE) 2025-05-07T20:26:27.0776097Z #define __cpp_noexcept_function_type 201510L 2025-05-07T20:26:27.0776400Z #define M_PI 3.14159265358979323846 2025-05-07T20:26:27.0776708Z #define _GLIBCXX_PACKAGE_NAME "package-unused" 2025-05-07T20:26:27.0777039Z #define _GLIBCXX_HAVE_BUILTIN_IS_SAME 1 2025-05-07T20:26:27.0777336Z #define __GCC_ATOMIC_CHAR32_T_LOCK_FREE 2 2025-05-07T20:26:27.0777636Z #define _POSIX_DELAYTIMER_MAX 32 2025-05-07T20:26:27.0777913Z #define _GLIBCXX_USE_UTIME 1 2025-05-07T20:26:27.0778176Z #define _STL_ITERATOR_BASE_FUNCS_H 1 2025-05-07T20:26:27.0778938Z #define _IO_peekc_unlocked(_fp) (_IO_BE ((_fp)->_IO_read_ptr >= (_fp)->_IO_read_end, 0) && __underflow (_fp) == EOF ? EOF : *(unsigned char *) (_fp)->_IO_read_ptr) 2025-05-07T20:26:27.0779783Z #define _GLIBCXX_TR1_ELL_INTEGRAL_TCC 1 2025-05-07T20:26:27.0780298Z #define w_termsig __wait_terminated.__w_termsig 2025-05-07T20:26:27.0780614Z #define __FLOAT_WORD_ORDER __BYTE_ORDER 2025-05-07T20:26:27.0780912Z #define __cudaCDP2GetErrorName 2025-05-07T20:26:27.0781216Z #define XATTR_SIZE_MAX 65536 2025-05-07T20:26:27.0781506Z #define be64toh(x) __bswap_64 (x) 2025-05-07T20:26:27.0781818Z #define __ASSERT_VOID_CAST static_cast 2025-05-07T20:26:27.0782148Z #define __cpp_variadic_templates 200704L 2025-05-07T20:26:27.0782441Z #define RAND_MAX 2147483647 2025-05-07T20:26:27.0782913Z #define _GLIBCXX_USE_C99_COMPLEX_TR1 1 2025-05-07T20:26:27.0794572Z #define __UINT_FAST64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:26:27.0794988Z #define __SM_90_RT_H__ 2025-05-07T20:26:27.0795246Z #define __SIG_ATOMIC_TYPE__ int 2025-05-07T20:26:27.0795538Z #define __COMPAR_FN_T 2025-05-07T20:26:27.0795779Z #define __GID_T_TYPE __U32_TYPE 2025-05-07T20:26:27.0796059Z #define _IO_BAD_SEEN 0x4000 2025-05-07T20:26:27.0796538Z #define _PSTL_PRAGMA_MESSAGE_IMPL(x) _PSTL_PRAGMA(message(_PSTL_STRING_CONCAT(_PSTL_PRAGMA_LOCATION, x))) 2025-05-07T20:26:27.0797055Z #define __DBL_MIN_10_EXP__ (-307) 2025-05-07T20:26:27.0797406Z #define __glibcxx_requires_sorted_pred(_First,_Last,_Pred) 2025-05-07T20:26:27.0797774Z #define __FINITE_MATH_ONLY__ 0 2025-05-07T20:26:27.0798073Z #define _PSTL_PRAGMA_SIMD_INCLUSIVE_SCAN(PRM) 2025-05-07T20:26:27.0798416Z #define cudaArrayColorAttachment 0x20 2025-05-07T20:26:27.0798741Z #define __cpp_variable_templates 201304L 2025-05-07T20:26:27.0799258Z #define cudaKernelNodeAttributeMemSyncDomainMap cudaLaunchAttributeMemSyncDomainMap 2025-05-07T20:26:27.0799799Z #define __cpp_lib_integral_constant_callable 201304 2025-05-07T20:26:27.0800138Z #define _GLIBCXX_HAVE_SINHF 1 2025-05-07T20:26:27.0800425Z #define MOD_TIMECONST ADJ_TIMECONST 2025-05-07T20:26:27.0800722Z #define __cpp_lib_result_of_sfinae 201210 2025-05-07T20:26:27.0801034Z #define __SM_30_INTRINSICS_H__ 2025-05-07T20:26:27.0801308Z #define __FLT32X_MAX_EXP__ 1024 2025-05-07T20:26:27.0801578Z #define _GLIBCXX_USE_WCHAR_T 1 2025-05-07T20:26:27.0801850Z #define _GLIBCXX_MATH_H 1 2025-05-07T20:26:27.0802109Z #define __u_char_defined 2025-05-07T20:26:27.0802427Z #define WIFEXITED(status) __WIFEXITED (__WAIT_INT (status)) 2025-05-07T20:26:27.0802791Z #define STA_PPSERROR 0x0800 2025-05-07T20:26:27.0803058Z #define _GLIBCXX_STD_A std 2025-05-07T20:26:27.0803310Z #define __FLT32_HAS_DENORM__ 1 2025-05-07T20:26:27.0803599Z #define _GLIBCXX_BEGIN_NAMESPACE_VERSION 2025-05-07T20:26:27.0804230Z #define __device_builtin_texture_type__ __location__(device_builtin_texture_type) 2025-05-07T20:26:27.0804660Z #define FP_INFINITE 1 2025-05-07T20:26:27.0805027Z #define _GLIBCXX11_DEPRECATED_SUGGEST(ALT) _GLIBCXX_DEPRECATED_SUGGEST(ALT) 2025-05-07T20:26:27.0805454Z #define _IO_pid_t __pid_t 2025-05-07T20:26:27.0805717Z #define __UINT_FAST8_MAX__ 0xff 2025-05-07T20:26:27.0805979Z #define __LEAF , __leaf__ 2025-05-07T20:26:27.0806389Z #define PATH_MAX 4096 2025-05-07T20:26:27.0806649Z #define __cpp_rvalue_reference 200610L 2025-05-07T20:26:27.0806993Z #define __LDBL_REDIR1(name,proto,alias) name proto 2025-05-07T20:26:27.0807313Z #define _LIMITS_H___ 2025-05-07T20:26:27.0807538Z #define __size_t 2025-05-07T20:26:27.0807771Z #define _GLIBCXX_HAVE_FREXPF 1 2025-05-07T20:26:27.0808315Z #define STA_RONLY (STA_PPSSIGNAL | STA_PPSJITTER | STA_PPSWANDER | STA_PPSERROR | STA_CLOCKERR | STA_NANO | STA_MODE | STA_CLK) 2025-05-07T20:26:27.0808967Z #define _GLIBCXX_HAVE_FREXPL 1 2025-05-07T20:26:27.0809280Z #define __cpp_nested_namespace_definitions 201411L 2025-05-07T20:26:27.0809616Z #define __DEC64_MAX_EXP__ 385 2025-05-07T20:26:27.0809884Z #define _WCHAR_T_DEFINED 2025-05-07T20:26:27.0810241Z #define __glibcxx_requires_can_decrement_range(_First1,_Last1,_First2) 2025-05-07T20:26:27.0811034Z #define MOD_STATUS ADJ_STATUS 2025-05-07T20:26:27.0811336Z #define _GLIBCXX_PURE __attribute__ ((__pure__)) 2025-05-07T20:26:27.0811842Z #define _GLIBCXX_HAVE_STDINT_H 1 2025-05-07T20:26:27.0812137Z #define __SIZEOF_PTHREAD_CONDATTR_T 4 2025-05-07T20:26:27.0812425Z #define __INT8_C(c) c 2025-05-07T20:26:27.0812683Z #define __cudaCDP2GetParameterBuffer 2025-05-07T20:26:27.0812988Z #define _GLIBCXX_HAVE_COSHF 1 2025-05-07T20:26:27.0813255Z #define _GLIBCXX_HAVE_COSHL 1 2025-05-07T20:26:27.0813510Z #define __SM_70_RT_HPP__ 2025-05-07T20:26:27.0813765Z #define __INT_LEAST8_WIDTH__ 8 2025-05-07T20:26:27.0814045Z #define __cpp_variadic_using 201611L 2025-05-07T20:26:27.0814370Z #define __UINT_LEAST64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:26:27.0814693Z #define __INT_LEAST8_MAX__ 0x7f 2025-05-07T20:26:27.0814970Z #define __SM_61_INTRINSICS_HPP__ 2025-05-07T20:26:27.0815246Z #define _IO_FLAGS2_MMAP 1 2025-05-07T20:26:27.0815509Z #define __cpp_capture_star_this 201603L 2025-05-07T20:26:27.0815836Z #define __cudaCDP2LaunchDeviceV2_ptsz 2025-05-07T20:26:27.0816141Z #define _GLIBCXX_HAVE_ENDIAN_H 1 2025-05-07T20:26:27.0816507Z #define __always_inline __inline __attribute__ ((__always_inline__)) 2025-05-07T20:26:27.0816890Z #define NFDBITS __NFDBITS 2025-05-07T20:26:27.0817154Z #define _PSTL_PRAGMA_FORCEINLINE 2025-05-07T20:26:27.0817442Z #define _GLIBCXX_HAVE_SYS_STATVFS_H 1 2025-05-07T20:26:27.0817766Z #define __glibcxx_requires_sorted(_First,_Last) 2025-05-07T20:26:27.0818085Z #define __SHRT_MAX__ 0x7fff 2025-05-07T20:26:27.0818344Z #define _GLIBCXX_SYMVER_GNU 1 2025-05-07T20:26:27.0818638Z #define w_stopval __wait_stopped.__w_stopval 2025-05-07T20:26:27.0818944Z #define STA_UNSYNC 0x0040 2025-05-07T20:26:27.0819265Z #define __LDBL_MAX__ 1.18973149535723176502126385303097021e+4932L 2025-05-07T20:26:27.0819679Z #define _GLIBCXX_USE_C99_COMPLEX _GLIBCXX11_USE_C99_COMPLEX 2025-05-07T20:26:27.0820174Z #define __FLT64X_MAX_10_EXP__ 4932 2025-05-07T20:26:27.0820468Z #define __cpp_if_constexpr 201606L 2025-05-07T20:26:27.0820789Z #define __glibcxx_class_requires4(_a,_b,_c,_d,_e) 2025-05-07T20:26:27.0821166Z #define cudaStreamFireAndForget ((cudaStream_t)0x4) 2025-05-07T20:26:27.0821504Z #define _GLIBCXX_HAVE_WCHAR_H 1 2025-05-07T20:26:27.0821819Z #define _GLIBCXX_USE_C99_STDIO _GLIBCXX11_USE_C99_STDIO 2025-05-07T20:26:27.0822149Z #define __daddr_t_defined 2025-05-07T20:26:27.0822406Z #define __LDBL_IS_IEC_60559__ 2 2025-05-07T20:26:27.0822678Z #define _GLIBCXX_TR1_RIEMANN_ZETA_TCC 1 2025-05-07T20:26:27.0822997Z #define _GLIBCXX_HAVE_STRUCT_DIRENT_D_TYPE 1 2025-05-07T20:26:27.0823506Z #define _PSTL_CPP11_STD_ROTATE_BROKEN ((__GLIBCXX__ && __GLIBCXX__ < 20150716) || (_MSC_VER && _MSC_VER < 1800)) 2025-05-07T20:26:27.0823985Z #define _ACRTIMP 2025-05-07T20:26:27.0824204Z #define _IO_EOF_SEEN 0x10 2025-05-07T20:26:27.0824472Z #define _GLIBCXX_TR1_POLY_LAGUERRE_TCC 1 2025-05-07T20:26:27.0824859Z #define _IOS_BIN 128 2025-05-07T20:26:27.0825339Z #define __fortify_function __extern_always_inline __attribute_artificial__ 2025-05-07T20:26:27.0825889Z #define __FLT64X_HAS_QUIET_NAN__ 1 2025-05-07T20:26:27.0826162Z #define UNDERFLOW 4 2025-05-07T20:26:27.0826387Z #define NAME_MAX 255 2025-05-07T20:26:27.0826628Z #define SCHAR_MAX __SCHAR_MAX__ 2025-05-07T20:26:27.0826901Z #define __UINT_LEAST8_MAX__ 0xff 2025-05-07T20:26:27.0827175Z #define __GCC_ATOMIC_BOOL_LOCK_FREE 2 2025-05-07T20:26:27.0827469Z #define _IO_UNIFIED_JUMPTABLES 1 2025-05-07T20:26:27.0827848Z #define __FLT128_DENORM_MIN__ 6.47517511943802511092443895822764655e-4966F128 2025-05-07T20:26:27.0828236Z #define __ptr_t void * 2025-05-07T20:26:27.0828471Z #define M_E 2.7182818284590452354 2025-05-07T20:26:27.0828756Z #define cudaSurfaceType1D 0x01 2025-05-07T20:26:27.0829027Z #define __USE_ISOCXX11 1 2025-05-07T20:26:27.0829294Z #define __UINTMAX_TYPE__ long unsigned int 2025-05-07T20:26:27.0829619Z #define cudaDeviceBlockingSync 0x04 2025-05-07T20:26:27.0829922Z #define CLOCK_MONOTONIC_COARSE 6 2025-05-07T20:26:27.0830191Z #define _GLIBCXX_OS_DEFINES 1 2025-05-07T20:26:27.0830627Z #define _GLIBCXX_NODISCARD [[__nodiscard__]] 2025-05-07T20:26:27.0830946Z #define cudaSurfaceType2D 0x02 2025-05-07T20:26:27.0831287Z #define __linux 1 2025-05-07T20:26:27.0831522Z #define __DEC32_EPSILON__ 1E-6DF 2025-05-07T20:26:27.0831800Z #define cudaDeviceMask 0xff 2025-05-07T20:26:27.0832074Z #define _GLIBCXX_END_NAMESPACE_ALGO 2025-05-07T20:26:27.0832364Z #define __CUDA_API_VER_MAJOR__ 12 2025-05-07T20:26:27.0832651Z #define htobe16(x) __bswap_16 (x) 2025-05-07T20:26:27.0832943Z #define HUGE_VALF (__builtin_huge_valf()) 2025-05-07T20:26:27.0833243Z #define __FLT_EVAL_METHOD_TS_18661_3__ 0 2025-05-07T20:26:27.0833550Z #define HUGE_VALL (__builtin_huge_vall()) 2025-05-07T20:26:27.0833849Z #define _BITS_TYPES_H 1 2025-05-07T20:26:27.0834135Z #define ULONG_LONG_MAX (LONG_LONG_MAX * 2ULL + 1ULL) 2025-05-07T20:26:27.0834478Z #define _IO_cleanup_region_end(_Doit) 2025-05-07T20:26:27.0834786Z #define cudaSurfaceType3D 0x03 2025-05-07T20:26:27.0835061Z #define _GLIBCXX_HAVE_SYS_TIME_H 1 2025-05-07T20:26:27.0835357Z #define __cudaGet_blockIdx() blockIdx 2025-05-07T20:26:27.0835649Z #define _IO_DONT_CLOSE 0100000 2025-05-07T20:26:27.0836436Z #define __MATHDECLX(type,function,suffix,args,attrib) __MATHDECL_1(type, function,suffix, args) __attribute__ (attrib); __MATHDECL_1(type, __CONCAT(__,function),suffix, args) __attribute__ (attrib) 2025-05-07T20:26:27.0837233Z #define cudaHostRegisterDefault 0x00 2025-05-07T20:26:27.0837516Z #define __unix 1 2025-05-07T20:26:27.0837742Z #define MATH_ERRNO 1 2025-05-07T20:26:27.0837983Z #define _GLIBCXX_STDIO_SEEK_END 2 2025-05-07T20:26:27.0838266Z #define _GLIBCXX_USE_FCHMODAT 1 2025-05-07T20:26:27.0838540Z #define __UINT32_MAX__ 0xffffffffU 2025-05-07T20:26:27.0838820Z #define __GXX_EXPERIMENTAL_CXX0X__ 1 2025-05-07T20:26:27.0839113Z #define __UID_T_TYPE __U32_TYPE 2025-05-07T20:26:27.0839401Z #define _GLIBCXX_HAVE_ATOMIC_LOCK_POLICY 1 2025-05-07T20:26:27.0840024Z #define __CUDART_API_VERSION ((__CUDA_API_VER_MAJOR__ * 1000) + (__CUDA_API_VER_MINOR__ * 10)) 2025-05-07T20:26:27.0840499Z #define __nv_pure__ __location__(nv_pure) 2025-05-07T20:26:27.0840803Z #define CUDARTAPI_CDECL 2025-05-07T20:26:27.0841070Z #define _PSTL_USAGE_WARNINGS 0 2025-05-07T20:26:27.0841339Z #define _GLIBCXX98_USE_C99_COMPLEX 1 2025-05-07T20:26:27.0841629Z #define __cpp_lib_void_t 201411 2025-05-07T20:26:27.0841897Z #define _POSIX_AIO_MAX 1 2025-05-07T20:26:27.0842133Z #define __SIZE_T 2025-05-07T20:26:27.0842390Z #define isgraph_l(c,l) __isgraph_l ((c), (l)) 2025-05-07T20:26:27.0842707Z #define _GLIBCXX_FULLY_DYNAMIC_STRING 0 2025-05-07T20:26:27.0843000Z #define _POSIX_PIPE_BUF 512 2025-05-07T20:26:27.0843267Z #define _GLIBCXX_HAVE_STRTOLD 1 2025-05-07T20:26:27.0843536Z #define _ATFILE_SOURCE 1 2025-05-07T20:26:27.0843920Z #define __glibcxx_assert(cond) do { __glibcxx_constexpr_assert(cond); } while (false) 2025-05-07T20:26:27.0844352Z #define __WAIT_STATUS void * 2025-05-07T20:26:27.0844620Z #define __MATH_FUNCTIONS_H__ 2025-05-07T20:26:27.0844893Z #define _GLIBCXX_HAVE_WCSTOF 1 2025-05-07T20:26:27.0845162Z #define __FLT128_MIN_EXP__ (-16381) 2025-05-07T20:26:27.0845449Z #define _GLIBCXX_HAVE_LC_MESSAGES 1 2025-05-07T20:26:27.0845732Z #define __WINT_MIN__ 0U 2025-05-07T20:26:27.0846305Z #define _PSTL_CPP14_VARIABLE_TEMPLATES_PRESENT (!__INTEL_COMPILER || __INTEL_COMPILER >= 1700) && (_MSC_FULL_VER >= 190023918 || __cplusplus >= 201402L) 2025-05-07T20:26:27.0846946Z #define isdigit_l(c,l) __isdigit_l ((c), (l)) 2025-05-07T20:26:27.0847247Z #define WUNTRACED 2 2025-05-07T20:26:27.0847476Z #define _GLIBCXX_HAVE_SQRTF 1 2025-05-07T20:26:27.0847758Z #define __SIZEOF_PTHREAD_RWLOCKATTR_T 8 2025-05-07T20:26:27.0848045Z #define NZERO 20 2025-05-07T20:26:27.0848275Z #define _GLIBCXX_HAVE_MEMALIGN 1 2025-05-07T20:26:27.0848557Z #define _PSTL_PRAGMA(x) _Pragma(#x) 2025-05-07T20:26:27.0848856Z #define MOD_CLKA ADJ_OFFSET_SINGLESHOT 2025-05-07T20:26:27.0849150Z #define MOD_CLKB ADJ_TICK 2025-05-07T20:26:27.0849403Z #define __FLT128_MIN_10_EXP__ (-4931) 2025-05-07T20:26:27.0849690Z #define __FLT32X_IS_IEC_60559__ 2 2025-05-07T20:26:27.0850074Z #define __DEVICE_FUNCTIONS_H__ 2025-05-07T20:26:27.0850351Z #define SCHAR_MIN (-SCHAR_MAX - 1) 2025-05-07T20:26:27.0850714Z #define EXIT_FAILURE 1 2025-05-07T20:26:27.0850962Z #define ADJ_MAXERROR 0x0004 2025-05-07T20:26:27.0851223Z #define __INT_LEAST16_WIDTH__ 16 2025-05-07T20:26:27.0851498Z #define _SIZE_T_DEFINED_ 2025-05-07T20:26:27.0851757Z #define _POSIX_AIO_LISTIO_MAX 2 2025-05-07T20:26:27.0852040Z #define __cudaCDP2DeviceGetLimit 2025-05-07T20:26:27.0852383Z #define __LDBL_REDIR_NTH(name,proto) name proto __THROW 2025-05-07T20:26:27.0852760Z #define __cudaCDP2FuncGetAttributes 2025-05-07T20:26:27.0853062Z #define __SCHAR_MAX__ 0x7f 2025-05-07T20:26:27.0853321Z #define __FLT128_MANT_DIG__ 113 2025-05-07T20:26:27.0853599Z #define __USING_NAMESPACE_STD(name) 2025-05-07T20:26:27.0853892Z #define _GLIBCXX_HAVE_OBSOLETE_ISINF 1 2025-05-07T20:26:27.0854201Z #define __WCHAR_MIN__ (-__WCHAR_MAX__ - 1) 2025-05-07T20:26:27.0854494Z #define SEEK_DATA 3 2025-05-07T20:26:27.0854730Z #define __KERNEL_STRICT_NAMES 2025-05-07T20:26:27.0855031Z #define _IO_stderr ((_IO_FILE*)(&_IO_2_1_stderr_)) 2025-05-07T20:26:27.0855460Z #define _IO_ferror_unlocked(__fp) (((__fp)->_flags & _IO_ERR_SEEN) != 0) 2025-05-07T20:26:27.0855848Z #define _FUNCTEXCEPT_H 1 2025-05-07T20:26:27.0856104Z #define __INT64_C(c) c ## L 2025-05-07T20:26:27.0856379Z #define __NTH(fct) __LEAF_ATTR fct throw () 2025-05-07T20:26:27.0856709Z #define _GLIBCXX_CONST __attribute__ ((__const__)) 2025-05-07T20:26:27.0857100Z #define _GLIBCXX_HAVE_LINK 1 2025-05-07T20:26:27.0857464Z #define cudaNvSciSyncAttrWait 0x2 2025-05-07T20:26:27.0857774Z #define __GCC_ATOMIC_POINTER_LOCK_FREE 2 2025-05-07T20:26:27.0858071Z #define STA_PPSWANDER 0x0400 2025-05-07T20:26:27.0858332Z #define __INT_WCHAR_T_H 2025-05-07T20:26:27.0858576Z #define WSTOPPED 2 2025-05-07T20:26:27.0858839Z #define _POSIX_THREAD_THREADS_MAX 64 2025-05-07T20:26:27.0859195Z #define _POSIX_MQ_OPEN_MAX 8 2025-05-07T20:26:27.0859455Z #define FP_NORMAL 4 2025-05-07T20:26:27.0859714Z #define __cudaCDP2LaunchDevice_ptsz 2025-05-07T20:26:27.0860103Z #define _BITS_TIMEX_H 1 2025-05-07T20:26:27.0860441Z #define _POSIX_LINK_MAX 8 2025-05-07T20:26:27.0860705Z #define _GLIBCXX_HAVE_LIMIT_FSIZE 1 2025-05-07T20:26:27.0860994Z #define _GLIBCXX_HAVE_ATAN2F 1 2025-05-07T20:26:27.0861310Z #define cudaTextureType1D 0x01 2025-05-07T20:26:27.0861635Z #define _GLIBCXX_HAVE_ATAN2L 1 2025-05-07T20:26:27.0861914Z #define COLL_WEIGHTS_MAX 255 2025-05-07T20:26:27.0862197Z #define __isascii(c) (((c) & ~0x7f) == 0) 2025-05-07T20:26:27.0862494Z #define __toascii(c) ((c) & 0x7f) 2025-05-07T20:26:27.0862933Z #define __attribute_format_strfmon__(a,b) __attribute__ ((__format__ (__strfmon__, a, b))) 2025-05-07T20:26:27.0863393Z #define _IO_MAGIC 0xFBAD0000 2025-05-07T20:26:27.0863669Z #define _GLIBCXX_USE_SENDFILE 1 2025-05-07T20:26:27.0863932Z #define _POSIX_SOURCE 1 2025-05-07T20:26:27.0864191Z #define cudaTextureType2D 0x02 2025-05-07T20:26:27.0864464Z #define _PTR_TRAITS_H 1 2025-05-07T20:26:27.0864747Z #define _GLIBCXX_NOEXCEPT_QUAL noexcept (_NE) 2025-05-07T20:26:27.0865071Z #define _GLIBCXX_HAVE_POWF 1 2025-05-07T20:26:27.0865352Z #define _POSIX2_BC_STRING_MAX 1000 2025-05-07T20:26:27.0865678Z #define __attribute_used__ __attribute__ ((__used__)) 2025-05-07T20:26:27.0866036Z #define cudaTextureType3D 0x03 2025-05-07T20:26:27.0866323Z #define _STDIO_USES_IOSTREAM 2025-05-07T20:26:27.0866589Z #define CLOCK_REALTIME 0 2025-05-07T20:26:27.0866850Z #define __FLT32X_MANT_DIG__ 53 2025-05-07T20:26:27.0867142Z #define __GCC_ATOMIC_CHAR16_T_LOCK_FREE 2 2025-05-07T20:26:27.0867459Z #define __cpp_aligned_new 201606L 2025-05-07T20:26:27.0867750Z #define __USER_LABEL_PREFIX__ 2025-05-07T20:26:27.0868048Z #define cudaEventBlockingSync 0x01 2025-05-07T20:26:27.0868524Z #define _GLIBCXX_HAVE_TANL 1 2025-05-07T20:26:27.0868807Z #define _GLIBCXX_USE_PTHREAD_RWLOCK_T 1 2025-05-07T20:26:27.0869124Z #define _GLIBCXX_HAVE_LINUX_RANDOM_H 1 2025-05-07T20:26:27.0869436Z #define _GLIBCXX_USE_C99_FENV_TR1 1 2025-05-07T20:26:27.0869876Z #define __FLT32_MAX_10_EXP__ 38 2025-05-07T20:26:27.0870164Z #define __GLIBC__ 2 2025-05-07T20:26:27.0870486Z #define __END_DECLS } 2025-05-07T20:26:27.0870754Z #define FP_ILOGB0 (-2147483647 - 1) 2025-05-07T20:26:27.0871176Z #define __FLT64X_EPSILON__ 1.08420217248550443400745280086994171e-19F64x 2025-05-07T20:26:27.0871625Z #define __CONCAT(x,y) x ## y 2025-05-07T20:26:27.0871904Z #define WCONTINUED 8 2025-05-07T20:26:27.0872165Z #define __STDC_HOSTED__ 1 2025-05-07T20:26:27.0872458Z #define _GLIBCXX_HAVE_ARPA_INET_H 1 2025-05-07T20:26:27.0872772Z #define _ALLOCA_H 1 2025-05-07T20:26:27.0873028Z #define __host__ __location__(host) 2025-05-07T20:26:27.0873525Z #define __warndecl(name,msg) extern void name (void) __attribute__((__warning__ (msg))) 2025-05-07T20:26:27.0874051Z #define __SLONG32_TYPE int 2025-05-07T20:26:27.0874350Z #define _GLIBCXX_DEBUG_ASSERTIONS_H 1 2025-05-07T20:26:27.0874678Z #define _SYS_SELECT_H 1 2025-05-07T20:26:27.0874955Z #define _IO_LINE_BUF 0x200 2025-05-07T20:26:27.0875239Z #define _IOS_NOCREATE 32 2025-05-07T20:26:27.0875525Z #define __DEC64_MIN_EXP__ (-382) 2025-05-07T20:26:27.0875849Z #define __cudaGet_warpSize() warpSize 2025-05-07T20:26:27.0876179Z #define __SSIZE_T_TYPE __SWORD_TYPE 2025-05-07T20:26:27.0876511Z #define _GLIBCXX_HAVE_LIMIT_VMEM 0 2025-05-07T20:26:27.0876848Z #define __global__ __location__(global) 2025-05-07T20:26:27.0877176Z #define __GNU_LIBRARY__ 6 2025-05-07T20:26:27.0877469Z #define __cpp_decltype_auto 201304L 2025-05-07T20:26:27.0877771Z #define __DBL_DIG__ 15 2025-05-07T20:26:27.0878022Z #define TIME_UTC 1 2025-05-07T20:26:27.0878253Z #define __FLT32_DIG__ 6 2025-05-07T20:26:27.0878597Z #define __forceinline__ __inline__ __attribute__((always_inline)) 2025-05-07T20:26:27.0879010Z #define cudaHostAllocWriteCombined 0x04 2025-05-07T20:26:27.0879338Z #define cudaDeviceScheduleAuto 0x00 2025-05-07T20:26:27.0879671Z #define iscntrl_l(c,l) __iscntrl_l ((c), (l)) 2025-05-07T20:26:27.0879986Z #define _G_BUFSIZ 8192 2025-05-07T20:26:27.0880305Z #define __FLT_EPSILON__ 1.19209289550781250000000000000000000e-7F 2025-05-07T20:26:27.0880693Z #define cudaTextureTypeCubemap 0x0C 2025-05-07T20:26:27.0881009Z #define __cudaCDP2GetDevice 2025-05-07T20:26:27.0881303Z #define __cudaCDP2PeekAtLastError 2025-05-07T20:26:27.0881607Z #define STA_CLOCKERR 0x1000 2025-05-07T20:26:27.0881868Z #define __GXX_WEAK__ 1 2025-05-07T20:26:27.0882131Z #define __RLIM_T_TYPE __SYSCALL_ULONG_TYPE 2025-05-07T20:26:27.0882447Z #define _GLIBCXX_HAVE_ISNANF 1 2025-05-07T20:26:27.0882725Z #define __SHRT_WIDTH__ 16 2025-05-07T20:26:27.0883034Z #define __cpp_lib_robust_nonmodifying_seq_ops 201304 2025-05-07T20:26:27.0883381Z #define _GLIBCXX_BITS_SPECFUN_H 1 2025-05-07T20:26:27.0883674Z #define _GLIBCXX_HAVE_ISNANL 1 2025-05-07T20:26:27.0883982Z #define isblank_l(c,l) __isblank_l ((c), (l)) 2025-05-07T20:26:27.0884290Z #define _G_config_h 1 2025-05-07T20:26:27.0884586Z #define M_LOG2El 1.442695040888963407359924681001892137L 2025-05-07T20:26:27.0884935Z #define ADJ_OFFSET_SINGLESHOT 0x8001 2025-05-07T20:26:27.0885229Z #define _GCC_WCHAR_T 2025-05-07T20:26:27.0885478Z #define TMP_MAX 238328 2025-05-07T20:26:27.0885739Z #define __FLT32_IS_IEC_60559__ 2 2025-05-07T20:26:27.0886016Z #define __DEVICE_TYPES_H__ 2025-05-07T20:26:27.0886292Z #define __DEV_T_TYPE __UQUAD_TYPE 2025-05-07T20:26:27.0886588Z #define _EXT_NUMERIC_TRAITS 1 2025-05-07T20:26:27.0886875Z #define _GLIBCXX_BEGIN_NAMESPACE_ALGO 2025-05-07T20:26:27.0887177Z #define _IO_SKIPWS 01 2025-05-07T20:26:27.0887596Z #define cudaStreamGraphFireAndForgetAsSibling (cudaStream_t)0x0300000000000000 2025-05-07T20:26:27.0888074Z #define _IO_SCIENTIFIC 04000 2025-05-07T20:26:27.0888353Z #define _GLIBCXX_HAVE_STRING_H 1 2025-05-07T20:26:27.0888703Z #define __LDBL_MIN__ 3.36210314311209350626267781732175260e-4932L 2025-05-07T20:26:27.0889080Z #define cudaDeviceScheduleSpin 0x01 2025-05-07T20:26:27.0889453Z #define __nonnull(params) __attribute__ ((__nonnull__ params)) 2025-05-07T20:26:27.0890104Z #define __DBL_IS_IEC_60559__ 2 2025-05-07T20:26:27.0890691Z #define le32toh(x) (x) 2025-05-07T20:26:27.0890932Z #define _SIZE_T_DEFINED 2025-05-07T20:26:27.0891328Z #define _GLIBCXX_HAVE_XLOCALE_H 1 2025-05-07T20:26:27.0891685Z #define cudaArraySparsePropertiesSingleMipTail 0x1 2025-05-07T20:26:27.0892041Z #define __DEC32_MAX__ 9.999999E96DF 2025-05-07T20:26:27.0892450Z #define __WIFSIGNALED(status) (((signed char) (((status) & 0x7f) + 1) >> 1) > 0) 2025-05-07T20:26:27.0892874Z #define _GLIBCXX_HAVE_FMODL 1 2025-05-07T20:26:27.0893156Z #define _GLIBCXX_HAVE_POLL 1 2025-05-07T20:26:27.0893429Z #define __SM_32_INTRINSICS_H__ 2025-05-07T20:26:27.0893712Z #define _POSIX_NAME_MAX 14 2025-05-07T20:26:27.0894006Z #define __cpp_threadsafe_static_init 200806L 2025-05-07T20:26:27.0894541Z #define _GLIBCXX_MAKE_MOVE_IF_NOEXCEPT_ITERATOR(_Iter) std::__make_move_if_noexcept_iterator(_Iter) 2025-05-07T20:26:27.0895057Z #define _GLIBCXX_USE_CLOCK_REALTIME 1 2025-05-07T20:26:27.0895383Z #define __cpp_enumerator_attributes 201411L 2025-05-07T20:26:27.0895747Z #define __WCOREDUMP(status) ((status) & __WCOREFLAG) 2025-05-07T20:26:27.0896078Z #define _WCHAR_T_ 2025-05-07T20:26:27.0896329Z #define _GLIBCXX_FAST_MATH 0 2025-05-07T20:26:27.0896708Z #define __FLT64X_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951F64x 2025-05-07T20:26:27.0897100Z #define RTSIG_MAX 32 2025-05-07T20:26:27.0897340Z #define _STDDEF_H 2025-05-07T20:26:27.0897586Z #define CU_UUID_HAS_BEEN_DEFINED 2025-05-07T20:26:27.0897867Z #define _VA_LIST_DEFINED 2025-05-07T20:26:27.0898134Z #define __FLT32X_HAS_INFINITY__ 1 2025-05-07T20:26:27.0898482Z #define __glibcxx_requires_non_empty_range(_First,_Last) 2025-05-07T20:26:27.0898878Z #define __grid_constant__ __location__(grid_constant) 2025-05-07T20:26:27.0899221Z #define __INT32_MAX__ 0x7fffffff 2025-05-07T20:26:27.0899536Z #define _GLIBCXX_BEGIN_EXTERN_C extern "C" { 2025-05-07T20:26:27.0900141Z #define _PSTL_CPP14_INTEGER_SEQUENCE_PRESENT (_MSC_VER >= 1900 || __cplusplus >= 201402L) 2025-05-07T20:26:27.0900693Z #define __glibcxx_digits_b(T,B) (B - __glibcxx_signed_b (T,B)) 2025-05-07T20:26:27.0901080Z #define __SIZEOF_PTHREAD_COND_T 48 2025-05-07T20:26:27.0901420Z #define _PSTL_PRAGMA_SIMD_ORDERED_MONOTONIC(PRM) 2025-05-07T20:26:27.0901738Z #define __unix__ 1 2025-05-07T20:26:27.0901989Z #define __SM_60_ATOMIC_FUNCTIONS_H__ 2025-05-07T20:26:27.0902286Z #define __INT_WIDTH__ 32 2025-05-07T20:26:27.0902541Z #define __SIZEOF_LONG__ 8 2025-05-07T20:26:27.0902791Z #define _IONBF 2 2025-05-07T20:26:27.0903247Z #define __MATHCALLX(function,suffix,args,attrib) __MATHDECLX (_Mdouble_,function,suffix, args, attrib) 2025-05-07T20:26:27.0904016Z #define _IO_getc_unlocked(_fp) (_IO_BE ((_fp)->_IO_read_ptr >= (_fp)->_IO_read_end, 0) ? __uflow (_fp) : *(unsigned char *) (_fp)->_IO_read_ptr++) 2025-05-07T20:26:27.0904566Z #define __STDC_IEC_559__ 1 2025-05-07T20:26:27.0904841Z #define __STDC_ISO_10646__ 201103L 2025-05-07T20:26:27.0905126Z #define __UINT16_C(c) c 2025-05-07T20:26:27.0905375Z #define M_2_PI 0.63661977236758134308 2025-05-07T20:26:27.0905667Z #define STA_DEL 0x0020 2025-05-07T20:26:27.0905925Z #define __CUDACC_VER_MINOR__ 6 2025-05-07T20:26:27.0906188Z #define __id_t_defined 2025-05-07T20:26:27.0906476Z #define w_retcode __wait_terminated.__w_retcode 2025-05-07T20:26:27.0906935Z #define _IO_PENDING_OUTPUT_COUNT(_fp) ((_fp)->_IO_write_ptr - (_fp)->_IO_write_base) 2025-05-07T20:26:27.0907373Z #define _GLIBCXX_HAVE_MODFF 1 2025-05-07T20:26:27.0907651Z #define _GLIBCXX_HAVE_MODFL 1 2025-05-07T20:26:27.0907925Z #define __DECIMAL_DIG__ 21 2025-05-07T20:26:27.0908185Z #define _POSIX2_RE_DUP_MAX 255 2025-05-07T20:26:27.0908468Z #define __USE_FORTIFY_LEVEL 0 2025-05-07T20:26:27.0908748Z #define __STDC_IEC_559_COMPLEX__ 1 2025-05-07T20:26:27.0909020Z #define SING 2 2025-05-07T20:26:27.0909252Z #define STA_FREQHOLD 0x0080 2025-05-07T20:26:27.0909537Z #define __SM_32_ATOMIC_FUNCTIONS_HPP__ 2025-05-07T20:26:27.0909850Z #define cudaStreamDefault 0x00 2025-05-07T20:26:27.0910203Z #define __FLT64_EPSILON__ 2.22044604925031308084726333618164062e-16F64 2025-05-07T20:26:27.0910715Z #define _GLIBCXX_HAVE_HYPOTL 1 2025-05-07T20:26:27.0911026Z #define _GLIBCXX_HAVE_SYS_UIO_H 1 2025-05-07T20:26:27.0911401Z #define __gnu_linux__ 1 2025-05-07T20:26:27.0911652Z #define __INT16_MAX__ 0x7fff 2025-05-07T20:26:27.0911921Z #define _LARGEFILE_SOURCE 1 2025-05-07T20:26:27.0912177Z #define MAX_INPUT 255 2025-05-07T20:26:27.0912434Z #define __FLT64_MIN_EXP__ (-1021) 2025-05-07T20:26:27.0912775Z #define __isalpha_l(c,l) __isctype_l((c), _ISalpha, (l)) 2025-05-07T20:26:27.0913153Z #define __glibcxx_requires_heap(_First,_Last) 2025-05-07T20:26:27.0913483Z #define _GLIBCXX_CPU_DEFINES 1 2025-05-07T20:26:27.0913820Z #define _GLIBCXX_HAVE_POLL_H 1 2025-05-07T20:26:27.0914238Z #define __attribute_warn_unused_result__ __attribute__ ((__warn_unused_result__)) 2025-05-07T20:26:27.0914665Z #define _IO_SHOWPOS 02000 2025-05-07T20:26:27.0915028Z #define _GLIBCXX_HAVE_SYMVER_SYMBOL_RENAMING_RUNTIME_SUPPORT 1 2025-05-07T20:26:27.0915400Z #define _Mfloat_ float 2025-05-07T20:26:27.0915678Z #define __glibcxx_requires_cond(_Cond,_Msg) 2025-05-07T20:26:27.0916004Z #define __FLT64X_MIN_10_EXP__ (-4931) 2025-05-07T20:26:27.0916313Z #define DELAYTIMER_MAX 2147483647 2025-05-07T20:26:27.0916814Z #define __glibcxx_max_b(T,B) (__glibcxx_signed_b (T,B) ? (((((T)1 << (__glibcxx_digits_b (T,B) - 1)) - 1) << 1) + 1) : ~(T)0) 2025-05-07T20:26:27.0917321Z #define __LDBL_HAS_QUIET_NAN__ 1 2025-05-07T20:26:27.0917614Z #define _GLIBCXX98_USE_C99_STDIO 1 2025-05-07T20:26:27.0917956Z #define cudaKernelNodeAttrID cudaLaunchAttributeID 2025-05-07T20:26:27.0918322Z #define __glibcxx_class_requires2(_a,_b,_c) 2025-05-07T20:26:27.0918631Z #define __USE_ISOC11 1 2025-05-07T20:26:27.0918877Z #define _BSD_SIZE_T_ 2025-05-07T20:26:27.0919117Z #define ADJ_MICRO 0x1000 2025-05-07T20:26:27.0928457Z #define _GLIBCXX_HAVE_FABSF 1 2025-05-07T20:26:27.0928795Z #define _GLIBCXX_HAVE_FABSL 1 2025-05-07T20:26:27.0929115Z #define _PSTL_PRAGMA_SIMD _PSTL_PRAGMA(omp simd) 2025-05-07T20:26:27.0929450Z #define __FLT64_MANT_DIG__ 53 2025-05-07T20:26:27.0929773Z #define __attribute_const__ __attribute__ ((__const__)) 2025-05-07T20:26:27.0930113Z #define __THROW throw () 2025-05-07T20:26:27.0930387Z #define __cudaGet_gridDim() gridDim 2025-05-07T20:26:27.0930681Z #define __SM_60_ATOMIC_FUNCTIONS_HPP__ 2025-05-07T20:26:27.0931044Z #define __glibcxx_requires_heap_pred(_First,_Last,_Pred) 2025-05-07T20:26:27.0931404Z #define htobe32(x) __bswap_32 (x) 2025-05-07T20:26:27.0931691Z #define _GLIBCXX_HAVE_POWL 1 2025-05-07T20:26:27.0931959Z #define __FLT64X_MANT_DIG__ 64 2025-05-07T20:26:27.0932233Z #define __GLIBC_HAVE_LONG_LONG 1 2025-05-07T20:26:27.0932503Z #define L_tmpnam 20 2025-05-07T20:26:27.0932736Z #define ___int_wchar_t_h 2025-05-07T20:26:27.0933090Z #define WIFCONTINUED(status) __WIFCONTINUED (__WAIT_INT (status)) 2025-05-07T20:26:27.0933484Z #define isascii(c) __isascii (c) 2025-05-07T20:26:27.0933748Z #define _T_PTRDIFF 2025-05-07T20:26:27.0934078Z #define _GLIBCXX_MOVE3(_Tp,_Up,_Vp) std::move(_Tp, _Up, _Vp) 2025-05-07T20:26:27.0934455Z #define toascii(c) __toascii (c) 2025-05-07T20:26:27.0934715Z #define __GNUC__ 11 2025-05-07T20:26:27.0934986Z #define __SYSCALL_ULONG_TYPE __ULONGWORD_TYPE 2025-05-07T20:26:27.0935295Z #define __GXX_RTTI 1 2025-05-07T20:26:27.0935526Z #define __pie__ 2 2025-05-07T20:26:27.0935751Z #define __MMX__ 1 2025-05-07T20:26:27.0935986Z #define __cudaCDP2Malloc 2025-05-07T20:26:27.0936247Z #define __timespec_defined 1 2025-05-07T20:26:27.0936499Z #define L_ctermid 9 2025-05-07T20:26:27.0936740Z #define __OFF64_T_TYPE __SQUAD_TYPE 2025-05-07T20:26:27.0937056Z #define __cudaCDP2GetParameterBufferV2 2025-05-07T20:26:27.0937445Z #define offsetof(TYPE,MEMBER) __builtin_offsetof (TYPE, MEMBER) 2025-05-07T20:26:27.0937824Z #define _BITS_POSIX2_LIM_H 1 2025-05-07T20:26:27.0938100Z #define _GLIBCXX98_USE_C99_STDLIB 1 2025-05-07T20:26:27.0938395Z #define cudaMemAttachGlobal 0x01 2025-05-07T20:26:27.0938706Z #define FD_SET(fd,fdsetp) __FD_SET (fd, fdsetp) 2025-05-07T20:26:27.0939024Z #define __FLT_HAS_DENORM__ 1 2025-05-07T20:26:27.0939614Z #define __SIZEOF_LONG_DOUBLE__ 16 2025-05-07T20:26:27.0940379Z #define _GLIBCXX_NATIVE_THREAD_ID (__gthread_active_p() ? __gthread_self() : (__gthread_t)1) 2025-05-07T20:26:27.0941147Z #define assert_perror(errnum) (!(errnum) ? __ASSERT_VOID_CAST (0) : __assert_perror_fail ((errnum), __FILE__, __LINE__, __ASSERT_FUNCTION)) 2025-05-07T20:26:27.0941755Z #define _IO_HAVE_ST_BLKSIZE _G_HAVE_ST_BLKSIZE 2025-05-07T20:26:27.0942065Z #define __USE_SVID 1 2025-05-07T20:26:27.0942324Z #define __constant__ __location__(constant) 2025-05-07T20:26:27.0942642Z #define _GLIBCXX_HAVE_POSIX_MEMALIGN 1 2025-05-07T20:26:27.0942946Z #define __device__ __location__(device) 2025-05-07T20:26:27.0943269Z #define _GLIBCXX_HAVE_EXCEPTION_PTR_SINCE_GCC46 1 2025-05-07T20:26:27.0943598Z #define _GLIBCXX_RES_LIMITS 1 2025-05-07T20:26:27.0943869Z #define M_1_PI 0.31830988618379067154 2025-05-07T20:26:27.0944147Z #define CUDART_DEVICE __device__ 2025-05-07T20:26:27.0944504Z #define __LDBL_REDIR1_NTH(name,proto,alias) name proto __THROW 2025-05-07T20:26:27.0944882Z #define M_PI_2 1.57079632679489661923 2025-05-07T20:26:27.0945164Z #define __BIGGEST_ALIGNMENT__ 16 2025-05-07T20:26:27.0945545Z #define cudaExternalSemaphoreWaitSkipNvSciBufMemSync 0x02 2025-05-07T20:26:27.0945931Z #define __STDC_UTF_16__ 1 2025-05-07T20:26:27.0946195Z #define LONG_MAX __LONG_MAX__ 2025-05-07T20:26:27.0946555Z #define __glibcxx_digits10_b(T,B) (__glibcxx_digits_b (T,B) * 643L / 2136) 2025-05-07T20:26:27.0946979Z #define _POSIX_THREAD_DESTRUCTOR_ITERATIONS 4 2025-05-07T20:26:27.0947299Z #define _POSIX_HOST_NAME_MAX 255 2025-05-07T20:26:27.0947568Z #define __FLT64_MAX_10_EXP__ 308 2025-05-07T20:26:27.0947838Z #define NGROUPS_MAX 65536 2025-05-07T20:26:27.0948097Z #define _GLIBCXX_NAMESPACE_LDBL 2025-05-07T20:26:27.0948359Z #define __USE_ISOC95 1 2025-05-07T20:26:27.0948591Z #define _TIME_H 1 2025-05-07T20:26:27.0948865Z #define M_LOG10El 0.434294481903251827651128918916605082L 2025-05-07T20:26:27.0949179Z #define __USE_ISOC99 1 2025-05-07T20:26:27.0949517Z #define __ASMNAME(cname) __ASMNAME2 (__USER_LABEL_PREFIX__, cname) 2025-05-07T20:26:27.0949887Z #define HOST_NAME_MAX 64 2025-05-07T20:26:27.0950144Z #define _POSIX_SEM_NSEMS_MAX 256 2025-05-07T20:26:27.0950410Z #define _IOS_ATEND 4 2025-05-07T20:26:27.0950650Z #define __SM_35_INTRINSICS_H__ 2025-05-07T20:26:27.0950979Z #define WTERMSIG(status) __WTERMSIG (__WAIT_INT (status)) 2025-05-07T20:26:27.0951377Z #define cudaStreamAttrValue cudaLaunchAttributeValue 2025-05-07T20:26:27.0951724Z #define _GLIBCXX_HAVE_S_ISREG 1 2025-05-07T20:26:27.0952054Z #define cudaSurfaceTypeCubemap 0x0C 2025-05-07T20:26:27.0952474Z #define __cpp_delegating_constructors 200604L 2025-05-07T20:26:27.0952841Z #define __FLT32_HAS_INFINITY__ 1 2025-05-07T20:26:27.0953113Z #define _STDIO_H 1 2025-05-07T20:26:27.0953509Z #define __isctype_l(c,type,locale) ((locale)->__ctype_b[(int) (c)] & (unsigned short int) type) 2025-05-07T20:26:27.0953980Z #define _GLIBCXX_PREDEFINED_OPS_H 1 2025-05-07T20:26:27.0954346Z #define __DBL_MAX__ double(1.79769313486231570814527423731704357e+308L) 2025-05-07T20:26:27.0954731Z #define _G_IO_IO_FILE_VERSION 0x20001 2025-05-07T20:26:27.0955027Z #define _POSIX_SIGQUEUE_MAX 32 2025-05-07T20:26:27.0955300Z #define _GLIBCXX_HAVE_GETS 1 2025-05-07T20:26:27.0955578Z #define _GLIBCXX_HAVE_LINUX_TYPES_H 1 2025-05-07T20:26:27.0955872Z #define __cpp_raw_strings 200710L 2025-05-07T20:26:27.0956184Z #define __INT_FAST32_MAX__ 0x7fffffffffffffffL 2025-05-07T20:26:27.0956506Z #define _GLIBCXX_HAVE_VFWSCANF 1 2025-05-07T20:26:27.0956780Z #define __DBL_HAS_INFINITY__ 1 2025-05-07T20:26:27.0957066Z #define __STDCPP_MATH_SPEC_FUNCS__ 201003L 2025-05-07T20:26:27.0957378Z #define _GLIBCXX_STDIO_EOF -1 2025-05-07T20:26:27.0957651Z #define __SIZEOF_PTHREAD_MUTEX_T 40 2025-05-07T20:26:27.0957947Z #define __CHANNEL_DESCRIPTOR_H__ 2025-05-07T20:26:27.0958308Z #define _ISbit(bit) ((bit) < 8 ? ((1 << (bit)) << 8) : ((1 << (bit)) >> 8)) 2025-05-07T20:26:27.0958674Z #define __SIZEOF_FLOAT__ 4 2025-05-07T20:26:27.0959075Z #define __USE_XOPEN 1 2025-05-07T20:26:27.0959321Z #define __SIZEOF_PTHREAD_RWLOCK_T 56 2025-05-07T20:26:27.0959972Z #define cudaStreamAttributeMemSyncDomain cudaLaunchAttributeMemSyncDomain 2025-05-07T20:26:27.0960415Z #define __USE_XOPEN2K 1 2025-05-07T20:26:27.0960662Z #define _PSTL_UDR_PRESENT 1 2025-05-07T20:26:27.0960933Z #define __HAVE_SPECULATION_SAFE_VALUE 1 2025-05-07T20:26:27.0961229Z #define _GLIBCXX_HAVE_COSF 1 2025-05-07T20:26:27.0961507Z #define __cpp_fold_expressions 201603L 2025-05-07T20:26:27.0962102Z #define cudaWaitExternalSemaphoresAsync __CUDART_API_PTSZ(cudaWaitExternalSemaphoresAsync_v2) 2025-05-07T20:26:27.0962622Z #define NL_LANGMAX _POSIX2_LINE_MAX 2025-05-07T20:26:27.0962910Z #define __DEC32_MIN_EXP__ (-94) 2025-05-07T20:26:27.0963401Z #define __glibcxx_requires_partitioned_upper(_First,_Last,_Value) 2025-05-07T20:26:27.0963926Z #define __DADDR_T_TYPE __S32_TYPE 2025-05-07T20:26:27.0964413Z #define cudaExternalSemaphoreSignalSkipNvSciBufMemSync 0x01 2025-05-07T20:26:27.0964909Z #define __END_NAMESPACE_C99 2025-05-07T20:26:27.0965197Z #define __glibcxx_integral_traps true 2025-05-07T20:26:27.0965483Z #define _POSIX_PATH_MAX 256 2025-05-07T20:26:27.0965748Z #define __INTPTR_WIDTH__ 64 2025-05-07T20:26:27.0966010Z #define __FLT64X_HAS_INFINITY__ 1 2025-05-07T20:26:27.0966274Z #define _ISOC11_SOURCE 1 2025-05-07T20:26:27.0966533Z #define _GLIBCXX_HAVE_LINUX_FUTEX 1 2025-05-07T20:26:27.0966827Z #define __UINT_LEAST32_MAX__ 0xffffffffU 2025-05-07T20:26:27.0967127Z #define _GLIBCXX_HAVE_QUICK_EXIT 1 2025-05-07T20:26:27.0967494Z #define __glibcxx_requires_irreflexive_pred2(_First,_Last,_Pred) 2025-05-07T20:26:27.0967882Z #define LONG_MIN (-LONG_MAX - 1L) 2025-05-07T20:26:27.0968154Z #define _GLIBCXX_HAVE_SINCOSF 1 2025-05-07T20:26:27.0968429Z #define _IO_UNITBUF 020000 2025-05-07T20:26:27.0968688Z #define _GLIBCXX_HAVE_SINCOSL 1 2025-05-07T20:26:27.0968951Z #define __FD_SETSIZE 1024 2025-05-07T20:26:27.0969201Z #define getc(_fp) _IO_getc (_fp) 2025-05-07T20:26:27.0969480Z #define be32toh(x) __bswap_32 (x) 2025-05-07T20:26:27.0969827Z #define _GLIBCXX_PACKAGE__GLIBCXX_VERSION "version-unused" 2025-05-07T20:26:27.0970186Z #define __FLT32X_HAS_DENORM__ 1 2025-05-07T20:26:27.0970456Z #define __INT_FAST16_TYPE__ long int 2025-05-07T20:26:27.0970772Z #define isxdigit_l(c,l) __isxdigit_l ((c), (l)) 2025-05-07T20:26:27.0971092Z #define _GLIBCXX_HAVE_GETIPINFO 1 2025-05-07T20:26:27.0971420Z #define __MMX_WITH_SSE__ 1 2025-05-07T20:26:27.0971728Z #define __isalnum_l(c,l) __isctype_l((c), _ISalnum, (l)) 2025-05-07T20:26:27.0972053Z #define _WCHAR_T_DEFINED_ 2025-05-07T20:26:27.0972344Z #define cudaIpcMemLazyEnablePeerAccess 0x01 2025-05-07T20:26:27.0972671Z #define _GLIBCXX_HAVE_AT_QUICK_EXIT 1 2025-05-07T20:26:27.0972962Z #define __INO_T_MATCHES_INO64_T 1 2025-05-07T20:26:27.0973234Z #define __USE_POSIX199506 1 2025-05-07T20:26:27.0973484Z #define _FEATURES_H 1 2025-05-07T20:26:27.0973727Z #define __LDBL_HAS_DENORM__ 1 2025-05-07T20:26:27.0974117Z #define _PSTL_PRAGMA_SIMD_REDUCTION(PRM) _PSTL_PRAGMA(omp simd reduction(PRM)) 2025-05-07T20:26:27.0974535Z #define __stub_getmsg 2025-05-07T20:26:27.0974774Z #define _IO_FIXED 010000 2025-05-07T20:26:27.0975050Z #define __cpp_lib_addressof_constexpr 201603 2025-05-07T20:26:27.0975364Z #define _GLIBCXX11_USE_C99_STDIO 1 2025-05-07T20:26:27.0975640Z #define __stub_setlogin 2025-05-07T20:26:27.0975882Z #define __stub_fattach 2025-05-07T20:26:27.0976127Z #define __cplusplus 201703L 2025-05-07T20:26:27.0976394Z #define __cpp_ref_qualifiers 200710L 2025-05-07T20:26:27.0976678Z #define _STRUCT_TIMEVAL 1 2025-05-07T20:26:27.0976944Z #define INFINITY (__builtin_inff()) 2025-05-07T20:26:27.0977221Z #define _IO_UNBUFFERED 2 2025-05-07T20:26:27.0977707Z #define cudaStreamAttributeSynchronizationPolicy cudaLaunchAttributeSynchronizationPolicy 2025-05-07T20:26:27.0978223Z #define _IO_INTERNAL 010 2025-05-07T20:26:27.0978480Z #define __DEC32_MIN__ 1E-95DF 2025-05-07T20:26:27.0978820Z #define cudaKernelNodeAttrValue cudaLaunchAttributeValue 2025-05-07T20:26:27.0979169Z #define __dev_t_defined 2025-05-07T20:26:27.0979597Z #define __DEPRECATED 1 2025-05-07T20:26:27.0979972Z #define __S32_TYPE int 2025-05-07T20:26:27.0980371Z #define __cpp_rvalue_references 200610L 2025-05-07T20:26:27.0980682Z #define __DBL_MAX_EXP__ 1024 2025-05-07T20:26:27.0980948Z #define _IO_fpos_t _G_fpos_t 2025-05-07T20:26:27.0981202Z #define __WCHAR_WIDTH__ 32 2025-05-07T20:26:27.0981809Z #define cudaKernelNodeAttributePreferredSharedMemoryCarveout cudaLaunchAttributePreferredSharedMemoryCarveout 2025-05-07T20:26:27.0982441Z #define _G_HAVE_MREMAP 1 2025-05-07T20:26:27.0982761Z #define __FLT32_MAX__ 3.40282346638528859811704183484516925e+38F32 2025-05-07T20:26:27.0983106Z #define OVERFLOW 3 2025-05-07T20:26:27.0983358Z #define __toascii_l(c,l) ((l), __toascii (c)) 2025-05-07T20:26:27.0983673Z #define __DEC128_EPSILON__ 1E-33DL 2025-05-07T20:26:27.0983959Z #define __SM_32_ATOMIC_FUNCTIONS_H__ 2025-05-07T20:26:27.0984301Z #define _GLIBCXX_DEFAULT_ABI_TAG _GLIBCXX_ABI_TAG_CXX11 2025-05-07T20:26:27.0984631Z #define __SSE2_MATH__ 1 2025-05-07T20:26:27.0984883Z #define __ATOMIC_HLE_RELEASE 131072 2025-05-07T20:26:27.0985197Z #define __FSFILCNT_T_TYPE __SYSCALL_ULONG_TYPE 2025-05-07T20:26:27.0985510Z #define _IO_STDIO_H 2025-05-07T20:26:27.0985755Z #define PDP_ENDIAN __PDP_ENDIAN 2025-05-07T20:26:27.0986051Z #define isspace_l(c,l) __isspace_l ((c), (l)) 2025-05-07T20:26:27.0986396Z #define __cudaCDP2Memcpy2DAsync 2025-05-07T20:26:27.0986693Z #define __PTRDIFF_MAX__ 0x7fffffffffffffffL 2025-05-07T20:26:27.0987008Z #define _GLIBCXX_HAVE_STRERROR_R 1 2025-05-07T20:26:27.0987274Z #define __amd64 1 2025-05-07T20:26:27.0987498Z #define _POSIX_TZNAME_MAX 6 2025-05-07T20:26:27.0987773Z #define __cudaCDP2Memset3DAsync 2025-05-07T20:26:27.0988057Z #define __SYSCALL_WORDSIZE 64 2025-05-07T20:26:27.0988344Z #define _GLIBCXX_HAVE_ATTRIBUTE_VISIBILITY 1 2025-05-07T20:26:27.0988654Z #define _EXT_TYPE_TRAITS 1 2025-05-07T20:26:27.0988931Z #define _GLIBCXX_HAVE_POSIX_SEMAPHORE 1 2025-05-07T20:26:27.0989226Z #define _POSIX_RE_DUP_MAX 255 2025-05-07T20:26:27.0989499Z #define __STDC_NO_THREADS__ 1 2025-05-07T20:26:27.0989753Z #define __bounded 2025-05-07T20:26:27.0990310Z #define __USECONDS_T_TYPE __U32_TYPE 2025-05-07T20:26:27.0990606Z #define _IO_DELETE_DONT_CLOSE 0x40 2025-05-07T20:26:27.0990888Z #define __BEGIN_NAMESPACE_STD 2025-05-07T20:26:27.0991160Z #define _PTRDIFF_T_DECLARED 2025-05-07T20:26:27.0991432Z #define __OFF_T_TYPE __SYSCALL_SLONG_TYPE 2025-05-07T20:26:27.0991754Z #define __W_STOPCODE(sig) ((sig) << 8 | 0x7f) 2025-05-07T20:26:27.0992173Z #define cudaStreamAttributePriority cudaLaunchAttributePriority 2025-05-07T20:26:27.0992571Z #define _GLIBCXX_HAVE_NETDB_H 1 2025-05-07T20:26:27.0992847Z #define __SM_20_INTRINSICS_HPP__ 2025-05-07T20:26:27.0993197Z #define __cpp_lib_has_unique_object_representations 201606 2025-05-07T20:26:27.0993533Z #define STA_PLL 0x0001 2025-05-07T20:26:27.0993780Z #define __ATOMIC_HLE_ACQUIRE 65536 2025-05-07T20:26:27.0994051Z #define __GNUG__ 11 2025-05-07T20:26:27.0994280Z #define _GLIBCXX_USE_GET_NPROCS 1 2025-05-07T20:26:27.0994552Z #define _T_WCHAR 2025-05-07T20:26:27.0994794Z #define __cudaCDP2GetDeviceCount 2025-05-07T20:26:27.0995084Z #define __specialization_static 2025-05-07T20:26:27.0995396Z #define __LONG_LONG_MAX__ 0x7fffffffffffffffLL 2025-05-07T20:26:27.0995712Z #define __SIZEOF_SIZE_T__ 8 2025-05-07T20:26:27.0995973Z #define cudaArraySparse 0x40 2025-05-07T20:26:27.0996233Z #define STA_PPSFREQ 0x0002 2025-05-07T20:26:27.0996488Z #define __GLIBCXX__ 20230528 2025-05-07T20:26:27.0996776Z #define _IO_stdin ((_IO_FILE*)(&_IO_2_1_stdin_)) 2025-05-07T20:26:27.0997072Z #define _WCHAR_T 2025-05-07T20:26:27.0997297Z #define __cudaCDP2Free 2025-05-07T20:26:27.0997932Z #define __FD_ZERO(fdsp) do { int __d0, __d1; __asm__ __volatile__ ("cld; rep; " __FD_ZERO_STOS : "=c" (__d0), "=D" (__d1) : "a" (0), "0" (sizeof (fd_set) / sizeof (__fd_mask)), "1" (&__FDS_BITS (fdsp)[0]) : "memory"); } while (0) 2025-05-07T20:26:27.0998868Z #define __cpp_nsdmi 200809L 2025-05-07T20:26:27.0999291Z #define __glibcxx_min_b(T,B) (__glibcxx_signed_b (T,B) ? -__glibcxx_max_b (T,B) - 1 : (T)0) 2025-05-07T20:26:27.1000029Z #define __FLT64X_MIN_EXP__ (-16381) 2025-05-07T20:26:27.1000435Z #define __SIZEOF_WINT_T__ 4 2025-05-07T20:26:27.1000700Z #define cudaArrayCubemap 0x04 2025-05-07T20:26:27.1001036Z #define _PSTL_MONOTONIC_PRESENT (__INTEL_COMPILER >= 1800) 2025-05-07T20:26:27.1001388Z #define _GLIBCXX_UTILITY 1 2025-05-07T20:26:27.1001627Z #define __NO_CTYPE 1 2025-05-07T20:26:27.1001860Z #define __stub_bdflush 2025-05-07T20:26:27.1002237Z #define _GLIBCXX_MAKE_MOVE_ITERATOR(_Iter) std::make_move_iterator(_Iter) 2025-05-07T20:26:27.1002653Z #define __CORRECT_ISO_CPP_STRING_H_PROTO 2025-05-07T20:26:27.1002958Z #define _GLIBCXX_STDC_HEADERS 1 2025-05-07T20:26:27.1003231Z #define __LONG_LONG_WIDTH__ 64 2025-05-07T20:26:27.1003504Z #define __cpp_initializer_lists 200806L 2025-05-07T20:26:27.1003816Z #define _GLIBCXX_HAVE_NETINET_TCP_H 1 2025-05-07T20:26:27.1004117Z #define __U16_TYPE unsigned short int 2025-05-07T20:26:27.1004467Z #define __glibcxx_requires_can_increment(_First,_Size) 2025-05-07T20:26:27.1004809Z #define _GLIBCXX_HAVE_SYS_PARAM_H 1 2025-05-07T20:26:27.1005098Z #define __FLT32_MAX_EXP__ 128 2025-05-07T20:26:27.1005385Z #define cudaHostRegisterIoMemory 0x04 2025-05-07T20:26:27.1005722Z #define __FD_MASK(d) ((__fd_mask) 1 << ((d) % __NFDBITS)) 2025-05-07T20:26:27.1006065Z #define __cpp_lib_is_invocable 201703 2025-05-07T20:26:27.1006349Z #define _IO_STDIO 040000 2025-05-07T20:26:27.1006674Z #define _SIGSET_NWORDS (1024 / (8 * sizeof (unsigned long int))) 2025-05-07T20:26:27.1007060Z #define cudaSurfaceType1DLayered 0xF1 2025-05-07T20:26:27.1007378Z #define cudaArraySurfaceLoadStore 0x02 2025-05-07T20:26:27.1007664Z #define _PTRDIFF_T 2025-05-07T20:26:27.1007885Z #define _MOVE_H 1 2025-05-07T20:26:27.1008111Z #define __cpp_hex_float 201603L 2025-05-07T20:26:27.1008373Z #define ADJ_TAI 0x0080 2025-05-07T20:26:27.1008595Z #define __ptrvalue 2025-05-07T20:26:27.1008826Z #define _GLIBCXX_HOSTED 1 2025-05-07T20:26:27.1009084Z #define __GXX_ABI_VERSION 1016 2025-05-07T20:26:27.1009364Z #define __WTERMSIG(status) ((status) & 0x7f) 2025-05-07T20:26:27.1009673Z #define MATH_ERREXCEPT 2 2025-05-07T20:26:27.1009931Z #define _GLIBCXX_HAS_GTHREADS 1 2025-05-07T20:26:27.1010212Z #define cudaTextureType2DLayered 0xF2 2025-05-07T20:26:27.1010606Z #define __isleap(year) ((year) % 4 == 0 && ((year) % 100 != 0 || (year) % 400 == 0)) 2025-05-07T20:26:27.1010985Z #define __USE_GNU 1 2025-05-07T20:26:27.1011214Z #define __FLT128_HAS_INFINITY__ 1 2025-05-07T20:26:27.1011493Z #define __FLT_MIN_EXP__ (-125) 2025-05-07T20:26:27.1011770Z #define __GCC_HAVE_DWARF2_CFI_ASM 1 2025-05-07T20:26:27.1012150Z #define __FD_CLR(d,set) ((void) (__FDS_BITS (set)[__FD_ELT (d)] &= ~__FD_MASK (d))) 2025-05-07T20:26:27.1012535Z #define WEXITED 4 2025-05-07T20:26:27.1012757Z #define _IO_NO_READS 4 2025-05-07T20:26:27.1013060Z #define cudaGraphKernelNodePortLaunchCompletion 2 2025-05-07T20:26:27.1013402Z #define M_LOG2E 1.4426950408889634074 2025-05-07T20:26:27.1013688Z #define _POSIX_SYMLINK_MAX 255 2025-05-07T20:26:27.1013990Z #define _GLIBCXX_HAVE_BUILTIN_HAS_UNIQ_OBJ_REP 1 2025-05-07T20:26:27.1014306Z #define __uid_t_defined 2025-05-07T20:26:27.1014558Z #define __FD_ELT(d) ((d) / __NFDBITS) 2025-05-07T20:26:27.1014849Z #define _GLIBCXX_USE_STD_SPEC_FUNCS 1 2025-05-07T20:26:27.1015121Z #define WNOHANG 1 2025-05-07T20:26:27.1015371Z #define alloca(size) __builtin_alloca (size) 2025-05-07T20:26:27.1015677Z #define _GLIBCXX_HAVE_HYPOTF 1 2025-05-07T20:26:27.1015951Z #define cudaEventDefault 0x00 2025-05-07T20:26:27.1016255Z #define __maxnreg__(a) __attribute__((maxnreg(a))) 2025-05-07T20:26:27.1016584Z #define NL_SETMAX INT_MAX 2025-05-07T20:26:27.1016819Z #define __x86_64 1 2025-05-07T20:26:27.1017056Z #define __cudaCDP2LaunchDevice 2025-05-07T20:26:27.1017456Z #define __REDIRECT(name,proto,alias) name proto __asm__ (__ASMNAME (#alias)) 2025-05-07T20:26:27.1017940Z #define _GLIBCXX_BEGIN_NAMESPACE_CXX11 namespace __cxx11 { 2025-05-07T20:26:27.1018431Z #define __extern_always_inline extern __always_inline __attribute__ ((__gnu_inline__)) 2025-05-07T20:26:27.1018987Z #define __PTRDIFF_T 2025-05-07T20:26:27.1019401Z #define __exctype_l(name) extern int name (int, __locale_t) __THROW 2025-05-07T20:26:27.1019781Z #define _GLIBCXX_HAVE_FINITEL 1 2025-05-07T20:26:27.1020194Z #define __SM_35_ATOMIC_FUNCTIONS_H__ 2025-05-07T20:26:27.1020492Z #define _Mlong_double_ long double 2025-05-07T20:26:27.1020770Z #define __cpp_lambdas 200907L 2025-05-07T20:26:27.1021027Z #define _IO_DEC 020 2025-05-07T20:26:27.1021263Z #define _GLIBCXX_HAVE_SINHL 1 2025-05-07T20:26:27.1021537Z #define _POSIX_CLOCKRES_MIN 20000000 2025-05-07T20:26:27.1021820Z #define __INT_FAST64_TYPE__ long int 2025-05-07T20:26:27.1022106Z #define ADJ_TIMECONST 0x0020 2025-05-07T20:26:27.1022374Z #define _GLIBCXX_HAVE_SQRTL 1 2025-05-07T20:26:27.1022673Z #define __cudaCDP2DeviceGetSharedMemConfig 2025-05-07T20:26:27.1022997Z #define _GLIBCXX_HAVE_STDALIGN_H 1 2025-05-07T20:26:27.1023273Z #define _ANSI_STDDEF_H 2025-05-07T20:26:27.1023551Z #define _GLIBCXX_MOVE(__val) std::move(__val) 2025-05-07T20:26:27.1023868Z #define _GLIBCXX_HAVE_STRERROR_L 1 2025-05-07T20:26:27.1024240Z #define __FLT64_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F64 2025-05-07T20:26:27.1024619Z #define _GLIBCXX_USE_DEV_RANDOM 1 2025-05-07T20:26:27.1024904Z #define _STL_ITERATOR_BASE_TYPES_H 1 2025-05-07T20:26:27.1025200Z #define __cpp_template_auto 201606L 2025-05-07T20:26:27.1025554Z #define __DBL_MIN__ double(2.22507385850720138309023271733240406e-308L) 2025-05-07T20:26:27.1025923Z #define _GLIBCXX_HAVE_SYS_SEM_H 1 2025-05-07T20:26:27.1026198Z #define __key_t_defined 2025-05-07T20:26:27.1026452Z #define _IO_MAGIC_MASK 0xFFFF0000 2025-05-07T20:26:27.1026815Z #define __cluster_dims__(...) __attribute__((cluster_dims(__VA_ARGS__))) 2025-05-07T20:26:27.1027279Z #define __FLT128_EPSILON__ 1.92592994438723585305597794258492732e-34F128 2025-05-07T20:26:27.1027644Z #define __GNUC_VA_LIST 2025-05-07T20:26:27.1027977Z #define __FLT64X_NORM_MAX__ 1.18973149535723176502126385303097021e+4932F64x 2025-05-07T20:26:27.1028369Z #define __SIZEOF_POINTER__ 8 2025-05-07T20:26:27.1028640Z #define CLOCK_REALTIME_COARSE 5 2025-05-07T20:26:27.1028916Z #define _GLIBCXX14_CONSTEXPR constexpr 2025-05-07T20:26:27.1029211Z #define __USE_XOPEN2KXSI 1 2025-05-07T20:26:27.1029466Z #define __WCOREFLAG 0x80 2025-05-07T20:26:27.1029724Z #define M_2_SQRTPI 1.12837916709551257390 2025-05-07T20:26:27.1030026Z #define cudaEventDisableTiming 0x02 2025-05-07T20:26:27.1030306Z #define __LP64__ 1 2025-05-07T20:26:27.1030560Z #define __isascii_l(c,l) ((l), __isascii (c)) 2025-05-07T20:26:27.1030874Z #define cudaStreamNonBlocking 0x01 2025-05-07T20:26:27.1031161Z #define _IO_off64_t __off64_t 2025-05-07T20:26:27.1031431Z #define __DBL_HAS_QUIET_NAN__ 1 2025-05-07T20:26:27.1031692Z #define __time_t_defined 1 2025-05-07T20:26:27.1031949Z #define _POSIX_SYMLOOP_MAX 8 2025-05-07T20:26:27.1032298Z #define __FLT32X_EPSILON__ 2.22044604925031308084726333618164062e-16F32x 2025-05-07T20:26:27.1032664Z #define __USE_UNIX98 1 2025-05-07T20:26:27.1032912Z #define __MODE_T_TYPE __U32_TYPE 2025-05-07T20:26:27.1033199Z #define CLOCK_REALTIME_ALARM 8 2025-05-07T20:26:27.1033467Z #define _GLIBCXX_HAVE_STRINGS_H 1 2025-05-07T20:26:27.1033770Z #define __LEAF_ATTR __attribute__ ((__leaf__)) 2025-05-07T20:26:27.1034085Z #define __DECIMAL_BID_FORMAT__ 1 2025-05-07T20:26:27.1034349Z #define SEEK_CUR 1 2025-05-07T20:26:27.1034577Z #define __RLIM64_T_TYPE __UQUAD_TYPE 2025-05-07T20:26:27.1034852Z #define _ASSERT_H 1 2025-05-07T20:26:27.1035420Z #define _PSTL_PRAGMA_DECLARE_REDUCTION(NAME,OP) _PSTL_PRAGMA(omp declare reduction(NAME:OP : omp_out(omp_in)) initializer(omp_priv = omp_orig)) 2025-05-07T20:26:27.1036044Z #define _GLIBCXX_USE_DEPRECATED 1 2025-05-07T20:26:27.1036324Z #define CHAR_MAX SCHAR_MAX 2025-05-07T20:26:27.1036580Z #define _GLIBCXX_HAVE_SETENV 1 2025-05-07T20:26:27.1036843Z #define NL_ARGMAX _POSIX_ARG_MAX 2025-05-07T20:26:27.1037125Z #define _GLIBCXX_USE_UTIMENSAT 1 2025-05-07T20:26:27.1037615Z #define __extern_inline extern __inline __attribute__ ((__gnu_inline__)) 2025-05-07T20:26:27.1038024Z #define _GLIBCXX_DEBUG_ONLY(_Statement) 2025-05-07T20:26:27.1038796Z #define _IO_putc_unlocked(_ch,_fp) (_IO_BE ((_fp)->_IO_write_ptr >= (_fp)->_IO_write_end, 0) ? __overflow (_fp, (unsigned char) (_ch)) : (unsigned char) (*(_fp)->_IO_write_ptr++ = (_ch))) 2025-05-07T20:26:27.1039454Z #define _GLIBCXX_HAVE_BUILTIN_LAUNDER 1 2025-05-07T20:26:27.1039748Z #define _IO_BOOLALPHA 0200000 2025-05-07T20:26:27.1040093Z #define _PSTL_CPP17_EXECUTION_POLICIES_PRESENT (_MSC_VER >= 1912) 2025-05-07T20:26:27.1040474Z #define _GLIBCXX_PACKAGE_URL "" 2025-05-07T20:26:27.1040747Z #define __FLT64_MIN_10_EXP__ (-307) 2025-05-07T20:26:27.1041026Z #define cudaArrayDefault 0x00 2025-05-07T20:26:27.1041310Z #define __cudaCDP2LaunchDeviceV2 2025-05-07T20:26:27.1041606Z #define __FDS_BITS(set) ((set)->fds_bits) 2025-05-07T20:26:27.1041884Z #define TLOSS 5 2025-05-07T20:26:27.1042107Z #define __ssize_t_defined 2025-05-07T20:26:27.1042371Z #define __CUDACC_VER_BUILD__ 85 2025-05-07T20:26:27.1042648Z #define _GLIBCXX_HAVE_SYS_SOCKET_H 1 2025-05-07T20:26:27.1042944Z #define ULONG_MAX (LONG_MAX * 2UL + 1UL) 2025-05-07T20:26:27.1043245Z #define __FLT64X_DECIMAL_DIG__ 21 2025-05-07T20:26:27.1043607Z #define _GLIBCXX_NAMESPACE_LDBL_OR_CXX11 _GLIBCXX_NAMESPACE_CXX11 2025-05-07T20:26:27.1043986Z #define _POSIX_HIWAT _POSIX_PIPE_BUF 2025-05-07T20:26:27.1044272Z #define __DEC128_MIN__ 1E-6143DL 2025-05-07T20:26:27.1044570Z #define __cudaCDP2EventRecordWithFlags 2025-05-07T20:26:27.1044882Z #define _GLIBCXX_ATOMIC_BUILTINS 1 2025-05-07T20:26:27.1045180Z #define cudaPeerAccessDefault 0x00 2025-05-07T20:26:27.1045470Z #define __REGISTER_PREFIX__ 2025-05-07T20:26:27.1045731Z #define __UINT16_MAX__ 0xffff 2025-05-07T20:26:27.1046068Z #define __glibcxx_requires_sorted_set(_First1,_Last1,_First2) 2025-05-07T20:26:27.1046428Z #define _IOS_NOREPLACE 64 2025-05-07T20:26:27.1046674Z #define __cdecl 2025-05-07T20:26:27.1046934Z #define cudaEventInterprocess 0x04 2025-05-07T20:26:27.1047280Z #define M_SQRT1_2l 0.707106781186547524400844362104849039L 2025-05-07T20:26:27.1047614Z #define LOGIN_NAME_MAX 256 2025-05-07T20:26:27.1047867Z #define _IO_TIED_PUT_GET 0x400 2025-05-07T20:26:27.1048144Z #define X_TLOSS 1.41484755040568800000e+16 2025-05-07T20:26:27.1048439Z #define CUDA_IPC_HANDLE_SIZE 64 2025-05-07T20:26:27.1048705Z #define __LDBL_HAS_INFINITY__ 1 2025-05-07T20:26:27.1049018Z #define __attribute_pure__ __attribute__ ((__pure__)) 2025-05-07T20:26:27.1049356Z #define __TEXTURE_TYPES_H__ 2025-05-07T20:26:27.1049758Z #define __NV_GLIBCXX_VERSION (__GNUC__ * 10000 + __GNUC_MINOR__ * 100 + __GNUC_PATCHLEVEL__) 2025-05-07T20:26:27.1050192Z #define ADJ_NANO 0x2000 2025-05-07T20:26:27.1050503Z #define __FLT32_MIN__ 1.17549435082228750796873653722224568e-38F32 2025-05-07T20:26:27.1050861Z #define __UINT8_TYPE__ unsigned char 2025-05-07T20:26:27.1051152Z #define _GLIBCXX_HAVE_ISWBLANK 1 2025-05-07T20:26:27.1051418Z #define __FLT_DIG__ 6 2025-05-07T20:26:27.1060953Z #define __REDIRECT_LDBL(name,proto,alias) __REDIRECT (name, proto, alias) 2025-05-07T20:26:27.1061420Z #define __NO_INLINE__ 1 2025-05-07T20:26:27.1061756Z #define _PSTL_EARLYEXIT_PRESENT (__INTEL_COMPILER >= 1800) 2025-05-07T20:26:27.1062126Z #define _POSIX_NGROUPS_MAX 8 2025-05-07T20:26:27.1062432Z #define ADJ_STATUS 0x0010 2025-05-07T20:26:27.1062807Z #define __cudaCDP2MemcpyAsync_ptsz 2025-05-07T20:26:27.1063139Z #define CLOCK_BOOTTIME_ALARM 9 2025-05-07T20:26:27.1063415Z #define LONG_LONG_MAX __LONG_LONG_MAX__ 2025-05-07T20:26:27.1063727Z #define _GLIBCXX_HAVE_OBSOLETE_ISNAN 1 2025-05-07T20:26:27.1064026Z #define __DEC_EVAL_METHOD__ 2 2025-05-07T20:26:27.1064416Z #define cudaStreamGraphFireAndForget (cudaStream_t)0x0200000000000000 2025-05-07T20:26:27.1064832Z #define _GLIBCXX_HAVE_ALIGNED_ALLOC 1 2025-05-07T20:26:27.1065266Z #define __DEC128_MAX__ 9.999999999999999999999999999999999E6144DL 2025-05-07T20:26:27.1065624Z #define CHAR_MIN SCHAR_MIN 2025-05-07T20:26:27.1065870Z #define MAX_CANON 255 2025-05-07T20:26:27.1066403Z #define __FLT_MANT_DIG__ 24 2025-05-07T20:26:27.1066692Z #define __LDBL_DECIMAL_DIG__ 21 2025-05-07T20:26:27.1067085Z #define _GLIBCXX_HAVE_COMPLEX_H 1 2025-05-07T20:26:27.1067441Z #define _PSTL_PRAGMA_VECTOR_UNALIGNED 2025-05-07T20:26:27.1067754Z #define _POSIX_FD_SETSIZE _POSIX_OPEN_MAX 2025-05-07T20:26:27.1068054Z #define _GLIBCXX_HAVE_HYPOT 1 2025-05-07T20:26:27.1068336Z #define __cudaCDP2Memset2DAsync_ptsz 2025-05-07T20:26:27.1068681Z #define _GLIBCXX_TR1_MODIFIED_BESSEL_FUNC_TCC 1 2025-05-07T20:26:27.1069075Z #define __VERSION__ "11.4.0" 2025-05-07T20:26:27.1069341Z #define _GLIBCXX11_USE_C99_STDLIB 1 2025-05-07T20:26:27.1069641Z #define cudaHostRegisterMapped 0x02 2025-05-07T20:26:27.1069940Z #define _GLIBCXX_HAVE_INT64_T 1 2025-05-07T20:26:27.1070222Z #define _GLIBCXX_USE_CONSTEXPR constexpr 2025-05-07T20:26:27.1070540Z #define FD_ZERO(fdsetp) __FD_ZERO (fdsetp) 2025-05-07T20:26:27.1070843Z #define __UINT64_C(c) c ## UL 2025-05-07T20:26:27.1071104Z #define MOD_OFFSET ADJ_OFFSET 2025-05-07T20:26:27.1071378Z #define _SYS_TYPES_H 1 2025-05-07T20:26:27.1071627Z #define AIO_PRIO_DELTA_MAX 20 2025-05-07T20:26:27.1071896Z #define _GLIBCXX_HAVE_TANHF 1 2025-05-07T20:26:27.1072148Z #define _SYS_CDEFS_H 1 2025-05-07T20:26:27.1072387Z #define _GLIBCXX_HAVE_TANHL 1 2025-05-07T20:26:27.1072661Z #define __cpp_unicode_characters 201411L 2025-05-07T20:26:27.1072963Z #define _IO_ERR_SEEN 0x20 2025-05-07T20:26:27.1073220Z #define _GLIBCXX_USE_DECIMAL_FLOAT 1 2025-05-07T20:26:27.1073510Z #define __cudaCDP2StreamDestroy 2025-05-07T20:26:27.1073788Z #define FP_SUBNORMAL 3 2025-05-07T20:26:27.1074048Z #define cudaOccupancyDefault 0x00 2025-05-07T20:26:27.1074326Z #define _INITIALIZER_LIST 2025-05-07T20:26:27.1074582Z #define _STDC_PREDEF_H 1 2025-05-07T20:26:27.1074835Z #define __CUDA_RUNTIME_API_H__ 2025-05-07T20:26:27.1075109Z #define _GLIBCXX_PACKAGE_BUGREPORT "" 2025-05-07T20:26:27.1075400Z #define _GLIBCXX_HAVE_MODF 1 2025-05-07T20:26:27.1075662Z #define _IO_file_flags _flags 2025-05-07T20:26:27.1075930Z #define __USE_XOPEN2K8 1 2025-05-07T20:26:27.1076176Z #define htobe64(x) __bswap_64 (x) 2025-05-07T20:26:27.1076461Z #define _OLD_STDIO_MAGIC 0xFABC0000 2025-05-07T20:26:27.1076740Z #define HUGE 3.40282347e+38F 2025-05-07T20:26:27.1077006Z #define __cpp_lib_is_null_pointer 201309 2025-05-07T20:26:27.1077391Z #define WEXITSTATUS(status) __WEXITSTATUS (__WAIT_INT (status)) 2025-05-07T20:26:27.1077785Z #define islower_l(c,l) __islower_l ((c), (l)) 2025-05-07T20:26:27.1078089Z #define _GLIBCXX_USE_CXX11_ABI 1 2025-05-07T20:26:27.1078363Z #define _GLIBCXX_HAVE_SYMLINK 1 2025-05-07T20:26:27.1078621Z #define _BSD_SOURCE 1 2025-05-07T20:26:27.1078857Z #define _GLIBCXX_THROW(_EXC) 2025-05-07T20:26:27.1079725Z #define _GLIBCXX_HAS_NESTED_TYPE(_NTYPE) template> struct __has_ ##_NTYPE : false_type { }; template struct __has_ ##_NTYPE<_Tp, __void_t> : true_type { }; 2025-05-07T20:26:27.1080570Z #define __catch(X) catch(X) 2025-05-07T20:26:27.1080844Z #define __INT_LEAST32_MAX__ 0x7fffffff 2025-05-07T20:26:27.1081134Z #define LINE_MAX _POSIX2_LINE_MAX 2025-05-07T20:26:27.1081418Z #define __TIMER_T_TYPE void * 2025-05-07T20:26:27.1081675Z #define __STRING(x) #x 2025-05-07T20:26:27.1081917Z #define __GCC_ATOMIC_INT_LOCK_FREE 2 2025-05-07T20:26:27.1082197Z #define _T_PTRDIFF_ 2025-05-07T20:26:27.1082449Z #define _GLIBCXX_USE_NOEXCEPT noexcept 2025-05-07T20:26:27.1082751Z #define cudaEventWaitExternal 0x01 2025-05-07T20:26:27.1083032Z #define __unbounded 2025-05-07T20:26:27.1083281Z #define __DEVICE_ATOMIC_FUNCTIONS_H__ 2025-05-07T20:26:27.1083570Z #define __FLT128_MAX_EXP__ 16384 2025-05-07T20:26:27.1083866Z #define __INO_T_TYPE __SYSCALL_ULONG_TYPE 2025-05-07T20:26:27.1084170Z #define be16toh(x) __bswap_16 (x) 2025-05-07T20:26:27.1084450Z #define __cpp_lib_is_final 201402L 2025-05-07T20:26:27.1084745Z #define _GLIBCXX_BEGIN_NAMESPACE_CONTAINER 2025-05-07T20:26:27.1085074Z #define LONG_LONG_MIN (-LONG_LONG_MAX - 1LL) 2025-05-07T20:26:27.1085486Z #define __MATH_DECLARE_LDOUBLE 1 2025-05-07T20:26:27.1085764Z #define __managed__ __location__(managed) 2025-05-07T20:26:27.1086140Z #define _POSIX2_EXPR_NEST_MAX 32 2025-05-07T20:26:27.1086550Z #define __GNUC_PREREQ(maj,min) ((__GNUC__ << 16) + __GNUC_MINOR__ >= ((maj) << 16) + (min)) 2025-05-07T20:26:27.1086967Z #define _POSIX_STREAM_MAX 8 2025-05-07T20:26:27.1087230Z #define __LIBRARY_TYPES_H__ 2025-05-07T20:26:27.1087606Z #define _GLIBCXX_END_NAMESPACE_LDBL_OR_CXX11 _GLIBCXX_END_NAMESPACE_CXX11 2025-05-07T20:26:27.1088009Z #define __FLT32_MANT_DIG__ 24 2025-05-07T20:26:27.1088258Z #define _SYS_SIZE_T_H 2025-05-07T20:26:27.1088554Z #define _PSTL_VERSION_MINOR ((_PSTL_VERSION % 1000) / 10) 2025-05-07T20:26:27.1088896Z #define _GLIBCXX_STDLIB_H 1 2025-05-07T20:26:27.1089174Z #define isupper_l(c,l) __isupper_l ((c), (l)) 2025-05-07T20:26:27.1089470Z #define _CRTIMP 2025-05-07T20:26:27.1089698Z #define _GLIBCXX_CXX_CONFIG_H 1 2025-05-07T20:26:27.1090360Z #define __FLOAT_WORD_ORDER__ __ORDER_LITTLE_ENDIAN__ 2025-05-07T20:26:27.1090779Z #define STA_PPSJITTER 0x0200 2025-05-07T20:26:27.1091147Z #define _IO_feof_unlocked(__fp) (((__fp)->_flags & _IO_EOF_SEEN) != 0) 2025-05-07T20:26:27.1091554Z #define __SUSECONDS_T_TYPE __SYSCALL_SLONG_TYPE 2025-05-07T20:26:27.1091874Z #define _GLIBCXX_HAVE_ISINFF 1 2025-05-07T20:26:27.1092159Z #define __glibcxx_requires_subscript(_N) 2025-05-07T20:26:27.1092444Z #define __SIZE_T__ 2025-05-07T20:26:27.1092665Z #define __stub_gtty 2025-05-07T20:26:27.1092897Z #define __pid_t_defined 2025-05-07T20:26:27.1093164Z #define _GLIBCXX_FWDREF(_Tp) _Tp&& 2025-05-07T20:26:27.1093470Z #define __NLINK_T_TYPE __SYSCALL_ULONG_TYPE 2025-05-07T20:26:27.1095240Z #define __glibcxx_function_requires(...) 2025-05-07T20:26:27.1095541Z #define __SM_80_RT_HPP__ 2025-05-07T20:26:27.1095782Z #define __need_clockid_t 2025-05-07T20:26:27.1096032Z #define SSIZE_MAX LONG_MAX 2025-05-07T20:26:27.1096295Z #define _GLIBCXX_HAVE_USELOCALE 1 2025-05-07T20:26:27.1096617Z #define __glibcxx_requires_string_len(_String,_Len) 2025-05-07T20:26:27.1096940Z #define _IO_HEX 0100 2025-05-07T20:26:27.1097208Z #define __NFDBITS (8 * (int) sizeof (__fd_mask)) 2025-05-07T20:26:27.1097549Z #define cudaExternalMemoryDedicated 0x1 2025-05-07T20:26:27.1097862Z #define _GLIBCXX_HAVE_TGMATH_H 1 2025-05-07T20:26:27.1098145Z #define _GLIBCXX11_USE_C99_COMPLEX 1 2025-05-07T20:26:27.1098552Z #define _GLIBCXX17_DEPRECATED_SUGGEST(ALT) _GLIBCXX_DEPRECATED_SUGGEST(ALT) 2025-05-07T20:26:27.1098996Z #define ispunct_l(c,l) __ispunct_l ((c), (l)) 2025-05-07T20:26:27.1099316Z #define __cpp_aggregate_bases 201603L 2025-05-07T20:26:27.1099613Z #define __cudaGet_blockDim() blockDim 2025-05-07T20:26:27.1099723Z #define __cudaCDP2Memcpy3DAsync 2025-05-07T20:26:27.1099958Z #define __cudaCDP2MemcpyAsync 2025-05-07T20:26:27.1100062Z #define __stub_sstk 2025-05-07T20:26:27.1100158Z #define _IO_IN_BACKUP 0x100 2025-05-07T20:26:27.1100316Z #define _GLIBCXX_USE_C99_STDLIB _GLIBCXX11_USE_C99_STDLIB 2025-05-07T20:26:27.1100408Z #define __wur 2025-05-07T20:26:27.1100533Z #define isprint_l(c,l) __isprint_l ((c), (l)) 2025-05-07T20:26:27.1100624Z #define _G_HAVE_MMAP 1 2025-05-07T20:26:27.1100719Z #define _IO_OCT 040 2025-05-07T20:26:27.1100817Z #define __FLT128_HAS_DENORM__ 1 2025-05-07T20:26:27.1100915Z #define NL_MSGMAX INT_MAX 2025-05-07T20:26:27.1101009Z #define _GLIBCXX_USE_LFS 1 2025-05-07T20:26:27.1101142Z #define cudaDeviceScheduleBlockingSync 0x04 2025-05-07T20:26:27.1101241Z #define _POSIX_RTSIG_MAX 8 2025-05-07T20:26:27.1101347Z #define _GLIBCXX_NOEXCEPT noexcept 2025-05-07T20:26:27.1101536Z #define __glibcxx_requires_partitioned_lower(_First,_Last,_Value) 2025-05-07T20:26:27.1101639Z #define __FLT32_DECIMAL_DIG__ 9 2025-05-07T20:26:27.1101729Z #define _STL_ALGOBASE_H 1 2025-05-07T20:26:27.1101838Z #define __cudaCDP2MemsetAsync_ptsz 2025-05-07T20:26:27.1101937Z #define __off64_t_defined 2025-05-07T20:26:27.1102039Z #define _GLIBCXX_WEAK_DEFINITION 2025-05-07T20:26:27.1102130Z #define __FLT128_DIG__ 33 2025-05-07T20:26:27.1102478Z #define _GLIBCXX_USE_C99_INTTYPES_TR1 1 2025-05-07T20:26:27.1102581Z #define _GLIBCXX_HAVE_LOCALE_H 1 2025-05-07T20:26:27.1102676Z #define __INT32_C(c) c 2025-05-07T20:26:27.1102892Z #define __DEC64_EPSILON__ 1E-15DD 2025-05-07T20:26:27.1102995Z #define __ORDER_PDP_ENDIAN__ 3412 2025-05-07T20:26:27.1103100Z #define __DEC128_MIN_EXP__ (-6142) 2025-05-07T20:26:27.1103197Z #define __PDP_ENDIAN 3412 2025-05-07T20:26:27.1103287Z #define _ISOC95_SOURCE 1 2025-05-07T20:26:27.1103394Z #define _IO_fpos64_t _G_fpos64_t 2025-05-07T20:26:27.1103526Z #define M_PI_2l 1.570796326794896619231321691639751442L 2025-05-07T20:26:27.1103622Z #define BYTE_ORDER __BYTE_ORDER 2025-05-07T20:26:27.1103718Z #define __SM_90_RT_HPP__ 2025-05-07T20:26:27.1103818Z #define __INT_FAST32_TYPE__ long int 2025-05-07T20:26:27.1103919Z #define __have_pthread_attr_t 1 2025-05-07T20:26:27.1104029Z #define _GLIBCXX_HAVE_LIMIT_DATA 1 2025-05-07T20:26:27.1104249Z #define _GLIBCXX_BEGIN_NAMESPACE_LDBL_OR_CXX11 _GLIBCXX_BEGIN_NAMESPACE_CXX11 2025-05-07T20:26:27.1104373Z #define __cudaCDP2StreamWaitEvent 2025-05-07T20:26:27.1104478Z #define __cudaCDP2EventRecord 2025-05-07T20:26:27.1104579Z #define _BITS_TYPESIZES_H 1 2025-05-07T20:26:27.1104673Z #define htole32(x) (x) 2025-05-07T20:26:27.1104928Z #define __cudaCDP2OccupancyMaxActiveBlocksPerMultiprocessorWithFlags 2025-05-07T20:26:27.1105051Z #define __SYSCALL_SLONG_TYPE __SLONGWORD_TYPE 2025-05-07T20:26:27.1105158Z #define _GLIBCXX_USE_C99_MATH_TR1 1 2025-05-07T20:26:27.1105315Z #define WSTOPSIG(status) __WSTOPSIG (__WAIT_INT (status)) 2025-05-07T20:26:27.1105458Z #define _GLIBCXX_USE_C99_MATH _GLIBCXX11_USE_C99_MATH 2025-05-07T20:26:27.1105592Z #define __UINT_LEAST16_TYPE__ short unsigned int 2025-05-07T20:26:27.1105732Z #define __WIFEXITED(status) (__WTERMSIG(status) == 0) 2025-05-07T20:26:27.1105830Z #define ADJ_OFFSET 0x0001 2025-05-07T20:26:27.1105934Z #define cudaArrayLayered 0x01 2025-05-07T20:26:27.1106108Z #define _PSTL_ICC_18_OMP_SIMD_BROKEN (__INTEL_COMPILER == 1800) 2025-05-07T20:26:27.1106231Z #define cudaEventRecordDefault 0x00 2025-05-07T20:26:27.1106328Z #define _GLIBCXX_HAVE_FMODF 1 2025-05-07T20:26:27.1106437Z #define _PSTL_PRAGMA_MESSAGE(x) 2025-05-07T20:26:27.1106526Z #define unix 1 2025-05-07T20:26:27.1106623Z #define __DBL_HAS_DENORM__ 1 2025-05-07T20:26:27.1106718Z #define _POSIX_CHILD_MAX 25 2025-05-07T20:26:27.1106823Z #define _POSIX_MAX_INPUT 255 2025-05-07T20:26:27.1106946Z #define __cudaCDP2DeviceGetCacheConfig 2025-05-07T20:26:27.1107040Z #define __USE_POSIX 1 2025-05-07T20:26:27.1107135Z #define __FD_ZERO_STOS "stosq" 2025-05-07T20:26:27.1107270Z #define _PSTL_VERSION_MAJOR (_PSTL_VERSION / 1000) 2025-05-07T20:26:27.1107369Z #define __THROWNL throw () 2025-05-07T20:26:27.1107465Z #define __cpp_rtti 199711L 2025-05-07T20:26:27.1107572Z #define __SIZE_TYPE__ long unsigned int 2025-05-07T20:26:27.1107670Z #define __PMT(args) args 2025-05-07T20:26:27.1107787Z #define __UINT64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:26:27.1107939Z #define __va_arg_pack_len() __builtin_va_arg_pack_len () 2025-05-07T20:26:27.1108065Z #define __ULONGWORD_TYPE unsigned long int 2025-05-07T20:26:27.1108158Z #define _SIZE_T_DECLARED 2025-05-07T20:26:27.1108263Z #define _PSTL_STRING_AUX(x) #x 2025-05-07T20:26:27.1108365Z #define __FLT_IS_IEC_60559__ 2 2025-05-07T20:26:27.1108758Z #define _PSTL_CPP14_MAKE_REVERSE_ITERATOR_PRESENT (_MSC_VER >= 1900 || __cplusplus >= 201402L || __cpp_lib_make_reverse_iterator == 201402) 2025-05-07T20:26:27.1108867Z #define _GLIBCXX_HAVE_LIMIT_AS 1 2025-05-07T20:26:27.1108962Z #define XATTR_LIST_MAX 65536 2025-05-07T20:26:27.1109060Z #define __CUDACC_VER_MAJOR__ 12 2025-05-07T20:26:27.1109209Z #define __GNUC_WIDE_EXECUTION_CHARSET_NAME "UTF-32LE" 2025-05-07T20:26:27.1109292Z #define _WCHAR_T_H 2025-05-07T20:26:27.1109385Z #define __FLT64X_DIG__ 18 2025-05-07T20:26:27.1109482Z #define _IO_SHOWBASE 0200 2025-05-07T20:26:27.1109578Z #define _POSIX_QLIMIT 1 2025-05-07T20:26:27.1109680Z #define __INT8_TYPE__ signed char 2025-05-07T20:26:27.1109783Z #define __SURFACE_TYPES_H__ 2025-05-07T20:26:27.1109971Z #define __CUDA_ARCH__ 520 2025-05-07T20:26:27.1110089Z #define __cpp_digit_separators 201309L 2025-05-07T20:26:27.1110280Z #define __ELF__ 1 2025-05-07T20:26:27.1110385Z #define CLOCK_THREAD_CPUTIME_ID 3 2025-05-07T20:26:27.1110491Z #define __GCC_ASM_FLAG_OUTPUTS__ 1 2025-05-07T20:26:27.1110581Z #define STA_INS 0x0010 2025-05-07T20:26:27.1110683Z #define __UINT32_TYPE__ unsigned int 2025-05-07T20:26:27.1110863Z #define _toupper(c) ((int) (*__ctype_toupper_loc ())[(int) (c)]) 2025-05-07T20:26:27.1110961Z #define _BITS_BYTESWAP_H 1 2025-05-07T20:26:27.1111057Z #define __ID_T_TYPE __U32_TYPE 2025-05-07T20:26:27.1111179Z #define __TIME_T_TYPE __SYSCALL_SLONG_TYPE 2025-05-07T20:26:27.1111293Z #define __DEVICE_DOUBLE_FUNCTIONS_HPP__ 2025-05-07T20:26:27.1111395Z #define _GLIBCXX_HAVE_MBSTATE_T 1 2025-05-07T20:26:27.1111509Z #define __cpp_lib_logical_traits 201510 2025-05-07T20:26:27.1111609Z #define ADJ_OFFSET_SS_READ 0xa001 2025-05-07T20:26:27.1111775Z #define __warnattr(msg) __attribute__((__warning__ (msg))) 2025-05-07T20:26:27.1111941Z #define _PSTL_PRAGMA_LOCATION " [Parallel STL message]: " 2025-05-07T20:26:27.1112048Z #define _IO_funlockfile(_fp) 2025-05-07T20:26:27.1112378Z #define cudaKernelNodeAttributeAccessPolicyWindow cudaLaunchAttributeAccessPolicyWindow 2025-05-07T20:26:27.1112511Z #define M_2_PIl 0.636619772367581343075535053490057448L 2025-05-07T20:26:27.1112605Z #define __DRIVER_TYPES_H__ 2025-05-07T20:26:27.1112702Z #define __FLT_RADIX__ 2 2025-05-07T20:26:27.1112808Z #define __INT_LEAST16_TYPE__ short int 2025-05-07T20:26:27.1112975Z #define __LDBL_EPSILON__ 1.08420217248550443400745280086994171e-19L 2025-05-07T20:26:27.1113079Z #define __UINTMAX_C(c) c ## UL 2025-05-07T20:26:27.1113175Z #define _GLIBCXX_USE_LSTAT 1 2025-05-07T20:26:27.1113284Z #define minor(dev) gnu_dev_minor (dev) 2025-05-07T20:26:27.1113381Z #define _POSIX_C_SOURCE 200809L 2025-05-07T20:26:27.1113480Z #define _GLIBCXX_HAVE_DIRENT_H 1 2025-05-07T20:26:27.1113605Z #define __GLIBCXX_BITSIZE_INT_N_0 128 2025-05-07T20:26:27.1113699Z #define WORD_BIT 32 2025-05-07T20:26:27.1113790Z #define _IO_USER_BUF 1 2025-05-07T20:26:27.1113887Z #define __VECTOR_TYPES_H__ 2025-05-07T20:26:27.1114005Z #define __SM_20_ATOMIC_FUNCTIONS_HPP__ 2025-05-07T20:26:27.1114120Z #define cudaHostAllocPortable 0x01 2025-05-07T20:26:27.1114223Z #define PTHREAD_STACK_MIN 16384 2025-05-07T20:26:27.1114329Z #define __long_double_t long double 2025-05-07T20:26:27.1114427Z #define _GLIBCXX_HAVE_ISINF 1 2025-05-07T20:26:27.1114527Z #define _POSIX_ARG_MAX 4096 2025-05-07T20:26:27.1114926Z #define cudaKernelNodeAttributeDeviceUpdatableKernelNode cudaLaunchAttributeDeviceUpdatableKernelNode 2025-05-07T20:26:27.1115010Z #define __k8 1 2025-05-07T20:26:27.1115210Z #define _GLIBCXX_NO_OBSOLETE_ISINF_ISNAN_DYNAMIC __GLIBC_PREREQ(2,23) 2025-05-07T20:26:27.1115382Z #define __FLT32X_MIN__ 2.22507385850720138309023271733240406e-308F32x 2025-05-07T20:26:27.1115501Z #define __LDBL_REDIR(name,proto) name proto 2025-05-07T20:26:27.1115608Z #define __SIG_ATOMIC_MAX__ 0x7fffffff 2025-05-07T20:26:27.1115715Z #define __SM_30_INTRINSICS_HPP__ 2025-05-07T20:26:27.1115827Z #define _GLIBCXX_EXTERN_TEMPLATE 1 2025-05-07T20:26:27.1115928Z #define __blksize_t_defined 2025-05-07T20:26:27.1116024Z #define _IO_SHOWPOINT 0400 2025-05-07T20:26:27.1116131Z #define _GLIBCXX_HAVE_LIMIT_RSS 1 2025-05-07T20:26:27.1116247Z #define cudaDeviceLmemResizeToMax 0x10 2025-05-07T20:26:27.1116343Z #define _GLIBCXX_X86_RDRAND 1 2025-05-07T20:26:27.1116456Z #define __GCC_ATOMIC_WCHAR_T_LOCK_FREE 2 2025-05-07T20:26:27.1116554Z #define _IO_IS_FILEBUF 0x2000 2025-05-07T20:26:27.1116651Z #define _GLIBCXX_USE_DUAL_ABI 1 2025-05-07T20:26:27.1116913Z #define __bswap_constant_16(x) ((unsigned short int) ((((x) >> 8) & 0xff) | (((x) & 0xff) << 8))) 2025-05-07T20:26:27.1117253Z #define cudaSignalExternalSemaphoresAsync __CUDART_API_PTSZ(cudaSignalExternalSemaphoresAsync_v2) 2025-05-07T20:26:27.1117365Z #define UCHAR_MAX (SCHAR_MAX * 2 + 1) 2025-05-07T20:26:27.1117465Z #define __SIZEOF_PTRDIFF_T__ 8 2025-05-07T20:26:27.1117636Z #define SEEK_SET 0 2025-05-07T20:26:27.1117744Z #define _GLIBCXX_TR1_GAMMA_TCC 1 2025-05-07T20:26:27.1117917Z #define __CUDA_API_VER_MINOR__ 6 2025-05-07T20:26:27.1118114Z #define _GLIBCXX_VISIBILITY(V) __attribute__ ((__visibility__ (#V))) 2025-05-07T20:26:27.1118230Z #define _GLIBCXX20_DEPRECATED(MSG) 2025-05-07T20:26:27.1118336Z #define __cudaCDP2GetLastError 2025-05-07T20:26:27.1118431Z #define _GLIBCXX_HAVE_COSL 1 2025-05-07T20:26:27.1118530Z #define _MATH_H_MATHDEF 1 2025-05-07T20:26:27.1118847Z #define __bswap_constant_32(x) ((((x) & 0xff000000) >> 24) | (((x) & 0x00ff0000) >> 8) | (((x) & 0x0000ff00) << 8) | (((x) & 0x000000ff) << 24)) 2025-05-07T20:26:27.1118947Z #define _GLIBCXX_USE_FLOAT128 1 2025-05-07T20:26:27.1119052Z #define _IO_FLAGS2_NOTCANCEL 2 2025-05-07T20:26:27.1119142Z #define __stub_sigreturn 2025-05-07T20:26:27.1119386Z #define __errordecl(name,msg) extern void name (void) __attribute__((__error__ (msg))) 2025-05-07T20:26:27.1119485Z #define _GLIBCXX_HAVE_UTIME_H 1 2025-05-07T20:26:27.1119585Z #define __HOST_CONFIG_H__ 2025-05-07T20:26:27.1119696Z #define _XOPEN_SOURCE_EXTENDED 1 2025-05-07T20:26:27.1119787Z #define CLOCK_TAI 11 2025-05-07T20:26:27.1119896Z #define _GLIBCXX_END_NAMESPACE_VERSION 2025-05-07T20:26:27.1119991Z #define __restrict_arr 2025-05-07T20:26:27.1120105Z #define _PSTL_PRAGMA_MESSAGE_POLICIES(x) 2025-05-07T20:26:27.1120248Z #define __glibcxx_requires_valid_range(_First,_Last) 2025-05-07T20:26:27.1120779Z #define strndupa(s,n) (__extension__ ({ const char *__old = (s); size_t __len = strnlen (__old, (n)); char *__new = (char *) __builtin_alloca (__len + 1); __new[__len] = '\0'; (char *) memcpy (__new, __old, __len); })) 2025-05-07T20:26:27.1120965Z #define __attribute_artificial__ __attribute__ ((__artificial__)) 2025-05-07T20:26:27.1121056Z #define __USE_MISC 1 2025-05-07T20:26:27.1121161Z #define __UWORD_TYPE unsigned long int 2025-05-07T20:26:27.1121261Z #define _EXCEPTION_DEFINES_H 1 2025-05-07T20:26:27.1121356Z #define _GCC_LIMITS_H_ 2025-05-07T20:26:27.1121447Z #define __LDBL_DIG__ 18 2025-05-07T20:26:27.1121545Z #define __BIT_TYPES_DEFINED__ 1 2025-05-07T20:26:27.1121657Z #define __malloc_and_calloc_defined 2025-05-07T20:26:27.1121754Z #define __FLT64_IS_IEC_60559__ 2 2025-05-07T20:26:27.1121861Z #define _GLIBCXX_HAVE_SYS_SYSINFO_H 1 2025-05-07T20:26:27.1121953Z #define __x86_64__ 1 2025-05-07T20:26:27.1122037Z #define _SIZE_T_ 2025-05-07T20:26:27.1122916Z #define __bswap_constant_64(x) (__extension__ ((((x) & 0xff00000000000000ull) >> 56) | (((x) & 0x00ff000000000000ull) >> 40) | (((x) & 0x0000ff0000000000ull) >> 24) | (((x) & 0x000000ff00000000ull) >> 8) | (((x) & 0x00000000ff000000ull) << 8) | (((x) & 0x0000000000ff0000ull) << 24) | (((x) & 0x000000000000ff00ull) << 40) | (((x) & 0x00000000000000ffull) << 56))) 2025-05-07T20:26:27.1123022Z #define _POSIX2_COLL_WEIGHTS_MAX 2 2025-05-07T20:26:27.1123122Z #define __FLT32X_MIN_EXP__ (-1021) 2025-05-07T20:26:27.1123247Z #define __PTHREAD_RWLOCK_INT_FLAGS_SHARED 1 2025-05-07T20:26:27.1123372Z #define __DEC32_SUBNORMAL_MIN__ 0.000001E-95DF 2025-05-07T20:26:27.1123475Z #define _IO_iconv_t _G_iconv_t 2025-05-07T20:26:27.1123592Z #define _GLIBCXX_FLOAT_IS_IEEE_BINARY32 1 2025-05-07T20:26:27.1123716Z #define __cpp_lib_make_reverse_iterator 201402 2025-05-07T20:26:27.1123863Z #define _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(A) 2025-05-07T20:26:27.1123963Z #define _GLIBCXX_HAVE_DLFCN_H 1 2025-05-07T20:26:27.1124424Z #define strdupa(s) (__extension__ ({ const char *__old = (s); size_t __len = strlen (__old) + 1; char *__new = (char *) __builtin_alloca (__len); (char *) memcpy (__new, __old, __len); })) 2025-05-07T20:26:27.1124560Z #define __no_return__ __attribute__((noreturn)) 2025-05-07T20:26:27.1124709Z #define __device_builtin__ __location__(device_builtin) 2025-05-07T20:26:27.1124813Z #define _PSTL_HIDE_FROM_ABI_POP 2025-05-07T20:26:27.1124918Z #define _GLIBCXX_HAVE_ACOSF 1 2025-05-07T20:26:27.1125007Z #define STA_FLL 0x0008 2025-05-07T20:26:27.1125158Z #define _GLIBCXX_HAVE_BUILTIN_IS_CONSTANT_EVALUATED 1 2025-05-07T20:26:27.1125340Z #define _GLIBCXX_END_EXTERN_C } 2025-05-07T20:26:27.1125468Z #define __INT_FAST16_MAX__ 0x7fffffffffffffffL 2025-05-07T20:26:27.1125661Z #define __cpp_lib_integer_sequence 201304 2025-05-07T20:26:27.1125751Z #define __stub_revoke 2025-05-07T20:26:27.1125845Z #define __timer_t_defined 1 2025-05-07T20:26:27.1125986Z #define _GLIBCXX11_DEPRECATED _GLIBCXX_DEPRECATED 2025-05-07T20:26:27.1126078Z #define INT_MAX __INT_MAX__ 2025-05-07T20:26:27.1126188Z #define ULLONG_MAX (LLONG_MAX * 2ULL + 1) 2025-05-07T20:26:27.1126302Z #define _GLIBCXX_END_NAMESPACE_CXX11 } 2025-05-07T20:26:27.1126402Z #define _GLIBCXX_ICONV_CONST 2025-05-07T20:26:27.1126512Z #define major(dev) gnu_dev_major (dev) 2025-05-07T20:26:27.1126626Z #define cudaArrayTextureGather 0x08 2025-05-07T20:26:27.1126730Z #define _GLIBCXX_LT_OBJDIR ".libs/" 2025-05-07T20:26:27.1126885Z #define __inline_hint__ __attribute__((nv_inline_hint)) 2025-05-07T20:26:27.1126985Z #define __NV_LEGACY_LAUNCH 1 2025-05-07T20:26:27.1127083Z #define _IO_off_t __off_t 2025-05-07T20:26:27.1127178Z #define __FLT64_DIG__ 15 2025-05-07T20:26:27.1127407Z #define PTHREAD_DESTRUCTOR_ITERATIONS _POSIX_THREAD_DESTRUCTOR_ITERATIONS 2025-05-07T20:26:27.1127510Z #define _POSIX2_LINE_MAX 2048 2025-05-07T20:26:27.1127647Z #define __UINT_FAST32_MAX__ 0xffffffffffffffffUL 2025-05-07T20:26:27.1127773Z #define __UINT_LEAST64_TYPE__ long unsigned int 2025-05-07T20:26:27.1127878Z #define ADJ_FREQUENCY 0x0002 2025-05-07T20:26:27.1127985Z #define __CUDART_API_PTDS(api) api 2025-05-07T20:26:27.1128073Z #define NULL __null 2025-05-07T20:26:27.1128216Z #define cudaStreamPerThread ((cudaStream_t)0x2) 2025-05-07T20:26:27.1128323Z #define _GLIBCXX_CONSTEXPR constexpr 2025-05-07T20:26:27.1128425Z #define __U64_TYPE unsigned long int 2025-05-07T20:26:27.1128531Z #define __FLT_HAS_QUIET_NAN__ 1 2025-05-07T20:26:27.1128627Z #define __FLT_MAX_10_EXP__ 38 2025-05-07T20:26:27.1128711Z #define FP_ZERO 2 2025-05-07T20:26:27.1128820Z #define _GLIBCXX_HAVE_FLOORL 1 2025-05-07T20:26:27.1128978Z #define __isgraph_l(c,l) __isctype_l((c), _ISgraph, (l)) 2025-05-07T20:26:27.1129088Z #define __LONG_MAX__ 0x7fffffffffffffffL 2025-05-07T20:26:27.1129187Z #define __WCHAR_T__ 2025-05-07T20:26:27.1129285Z #define __FLT64X_HAS_DENORM__ 1 2025-05-07T20:26:27.1129492Z #define __DEC128_SUBNORMAL_MIN__ 0.000000000000000000000000000000001E-6143DL 2025-05-07T20:26:27.1129647Z #define _GLIBCXX_NORETURN __attribute__ ((__noreturn__)) 2025-05-07T20:26:27.1129747Z #define __FLT_HAS_INFINITY__ 1 2025-05-07T20:26:27.1129880Z #define __GNUC_EXECUTION_CHARSET_NAME "UTF-8" 2025-05-07T20:26:27.1129997Z #define _GLIBCXX20_DEPRECATED_SUGGEST(ALT) 2025-05-07T20:26:27.1130128Z #define __WSTOPSIG(status) __WEXITSTATUS(status) 2025-05-07T20:26:27.1130265Z #define cudaSurfaceTypeCubemapLayered 0xFC 2025-05-07T20:26:27.1130359Z #define _BSD_PTRDIFF_T_ 2025-05-07T20:26:27.1130453Z #define _SIGSET_H_types 1 2025-05-07T20:26:27.1130578Z #define cudaTextureType1DLayered 0xF1 2025-05-07T20:26:27.1130685Z #define __cpp_unicode_literals 200710L 2025-05-07T20:26:27.1130847Z #define __isdigit_l(c,l) __isctype_l((c), _ISdigit, (l)) 2025-05-07T20:26:27.1130961Z #define __LONG_LONG_PAIR(HI,LO) LO, HI 2025-05-07T20:26:27.1131086Z #define __UINT_FAST16_TYPE__ long unsigned int 2025-05-07T20:26:27.1131225Z #define __bos0(ptr) __builtin_object_size (ptr, 0) 2025-05-07T20:26:27.1131335Z #define __DEC64_MAX__ 9.999999999999999E384DD 2025-05-07T20:26:27.1131469Z #define M_1_PIl 0.318309886183790671537767526745028724L 2025-05-07T20:26:27.1131650Z #define WIFSTOPPED(status) __WIFSTOPPED (__WAIT_INT (status)) 2025-05-07T20:26:27.1131746Z #define __INT_FAST32_WIDTH__ 64 2025-05-07T20:26:27.1131851Z #define _POSIX2_CHARCLASS_NAME_MAX 14 2025-05-07T20:26:27.1131957Z #define _GLIBCXX_BITS_STD_ABS_H 2025-05-07T20:26:27.1132047Z #define STA_MODE 0x4000 2025-05-07T20:26:27.1132165Z #define __CHAR16_TYPE__ short unsigned int 2025-05-07T20:26:27.1132269Z #define __PRAGMA_REDEFINE_EXTNAME 1 2025-05-07T20:26:27.1132387Z #define __glibcxx_signed_b(T,B) ((T)(-1) < 0) 2025-05-07T20:26:27.1132587Z #define __USING_NAMESPACE_C99(name) 2025-05-07T20:26:27.1132684Z #define BIG_ENDIAN __BIG_ENDIAN 2025-05-07T20:26:27.1132867Z #define __cudaCDP2EventRecord_ptsz 2025-05-07T20:26:27.1132972Z #define _GLIBCXX_HAVE_SINL 1 2025-05-07T20:26:27.1133091Z #define EXPR_NEST_MAX _POSIX2_EXPR_NEST_MAX 2025-05-07T20:26:27.1133183Z #define __SIZE_WIDTH__ 64 2025-05-07T20:26:27.1133310Z #define __BLKSIZE_T_TYPE __SYSCALL_SLONG_TYPE 2025-05-07T20:26:27.1133393Z #define __SEG_FS 1 2025-05-07T20:26:27.1133486Z #define _IO_size_t size_t 2025-05-07T20:26:27.1133591Z #define __INT_LEAST16_MAX__ 0x7fff 2025-05-07T20:26:27.1133692Z #define INT_MIN (-INT_MAX - 1) 2025-05-07T20:26:27.1133784Z #define __stub_lchmod 2025-05-07T20:26:27.1133877Z #define __DEC64_MANT_DIG__ 16 2025-05-07T20:26:27.1133987Z #define __INT64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:26:27.1134092Z #define _GLIBCXX_MANGLE_SIZE_T m 2025-05-07T20:26:27.1134178Z #define __SEG_GS 1 2025-05-07T20:26:27.1134367Z #define __FLT32_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F32 2025-05-07T20:26:27.1134465Z #define _IOS_APPEND 8 2025-05-07T20:26:27.1134568Z #define __SIG_ATOMIC_WIDTH__ 32 2025-05-07T20:26:27.1134662Z #define _GLIBCXX_RELEASE 11 2025-05-07T20:26:27.1134773Z #define _GLIBCXX98_USE_C99_WCHAR 1 2025-05-07T20:26:27.1134871Z #define _IO_IS_APPENDING 0x1000 2025-05-07T20:26:27.1134973Z #define __INT_LEAST64_TYPE__ long int 2025-05-07T20:26:27.1135067Z #define htole16(x) (x) 2025-05-07T20:26:27.1135177Z #define __TEXTURE_INDIRECT_FUNCTIONS_H__ 2025-05-07T20:26:27.1135280Z #define _GLIBCXX_HAVE_FCNTL_H 1 2025-05-07T20:26:27.1135377Z #define __INT16_TYPE__ short int 2025-05-07T20:26:27.1135481Z #define __INT_LEAST8_TYPE__ signed char 2025-05-07T20:26:27.1135592Z #define __glibcxx_class_requires(_a,_b) 2025-05-07T20:26:27.1135705Z #define __cpp_structured_bindings 201606L 2025-05-07T20:26:27.1135829Z #define __align__(n) __attribute__((aligned(n))) 2025-05-07T20:26:27.1135928Z #define __SIZEOF_INT__ 4 2025-05-07T20:26:27.1136025Z #define __WCLONE 0x80000000 2025-05-07T20:26:27.1136118Z #define __DEC32_MAX_EXP__ 97 2025-05-07T20:26:27.1136217Z #define SEEK_HOLE 4 2025-05-07T20:26:27.1136306Z #define TIMER_ABSTIME 1 2025-05-07T20:26:27.1136404Z #define __INT_FAST8_MAX__ 0x7f 2025-05-07T20:26:27.1136505Z #define __CUDA_MATH_CRTIMP 2025-05-07T20:26:27.1136682Z #define __FLT128_MAX__ 1.18973149535723176508575932662800702e+4932F128 2025-05-07T20:26:27.1136802Z #define __INTPTR_MAX__ 0x7fffffffffffffffL 2025-05-07T20:26:27.1136899Z #define __DRIVER_FUNCTIONS_H__ 2025-05-07T20:26:27.1137013Z #define __cpp_sized_deallocation 201309L 2025-05-07T20:26:27.1137118Z #define __MATH_FUNCTIONS_HPP__ 2025-05-07T20:26:27.1137244Z #define __cpp_guaranteed_copy_elision 201606L 2025-05-07T20:26:27.1137334Z #define _LINUX_LIMITS_H 2025-05-07T20:26:27.1137420Z #define linux 1 2025-05-07T20:26:27.1137518Z #define MOD_MICRO ADJ_MICRO 2025-05-07T20:26:27.1137632Z #define _GLIBCXX_DEBUG_ASSERT(_Condition) 2025-05-07T20:26:27.1137747Z #define _GLIBCXX_HAVE_VSWSCANF 1 2025-05-07T20:26:27.1137844Z #define _GLIBCXX_HAVE_ISNAN 1 2025-05-07T20:26:27.1137958Z #define _XOPEN_IOV_MAX _POSIX_UIO_MAXIOV 2025-05-07T20:26:27.1138113Z #define __cudart_builtin__ __location__(cudart_builtin) 2025-05-07T20:26:27.1138214Z #define __cpp_lib_hypot 201603 2025-05-07T20:26:27.1138320Z #define __FLT64_HAS_QUIET_NAN__ 1 2025-05-07T20:26:27.1138425Z #define _GLIBCXX_HAVE_WCTYPE_H 1 2025-05-07T20:26:27.1138519Z #define MOD_NANO ADJ_NANO 2025-05-07T20:26:27.1138614Z #define htole64(x) (x) 2025-05-07T20:26:27.1138719Z #define FP_ILOGBNAN (-2147483647 - 1) 2025-05-07T20:26:27.1138847Z #define _IO_stdout ((_IO_FILE*)(&_IO_2_1_stdout_)) 2025-05-07T20:26:27.1138952Z #define _IO_UPPERCASE 01000 2025-05-07T20:26:27.1139444Z #define cudaKernelNodeAttributeClusterSchedulingPolicyPreference cudaLaunchAttributeClusterSchedulingPolicyPreference 2025-05-07T20:26:27.1139537Z #define __USE_POSIX2 1 2025-05-07T20:26:27.1139649Z #define MOD_ESTERROR ADJ_ESTERROR 2025-05-07T20:26:27.1140038Z #define __WALL 0x40000000 2025-05-07T20:26:27.1140168Z #define _GLIBCXX_HAVE_LDEXPF 1 2025-05-07T20:26:27.1140253Z #define _XLOCALE_H 1 2025-05-07T20:26:27.1140430Z #define _GLIBCXX_USE_TMPNAM 1 2025-05-07T20:26:27.1140539Z #define __FLT32_MIN_10_EXP__ (-37) 2025-05-07T20:26:27.1140637Z #define __KEY_T_TYPE __S32_TYPE 2025-05-07T20:26:27.1140742Z #define __cudaGet_threadIdx() threadIdx 2025-05-07T20:26:27.1140841Z #define __EXCEPTIONS 1 2025-05-07T20:26:27.1140946Z #define __CUDART_API_PTSZ(api) api 2025-05-07T20:26:27.1141140Z #define __launch_bounds__(...) __annotate__(launch_bounds(__VA_ARGS__)) 2025-05-07T20:26:27.1141236Z #define __WORDSIZE 64 2025-05-07T20:26:27.1141332Z #define CLOCK_MONOTONIC 1 2025-05-07T20:26:27.1141422Z #define _STL_RELOPS_H 1 2025-05-07T20:26:27.1141524Z #define __PTRDIFF_WIDTH__ 64 2025-05-07T20:26:27.1141624Z #define __BEGIN_DECLS extern "C" { 2025-05-07T20:26:27.1141733Z #define _GLIBCXX_HAVE_SYS_IPC_H 1 2025-05-07T20:26:27.1141827Z #define __LDBL_MANT_DIG__ 64 2025-05-07T20:26:27.1141934Z #define _GLIBCXX_HAVE_TRUNCATE 1 2025-05-07T20:26:27.1142250Z #define cudaKernelNodeAttributeClusterDimension cudaLaunchAttributeClusterDimension 2025-05-07T20:26:27.1142482Z #define _PSTL_GCC_VERSION (__GNUC__ * 10000 + __GNUC_MINOR__ * 100 + __GNUC_PATCHLEVEL__) 2025-05-07T20:26:27.1142613Z #define _GLIBCXX_NAMESPACE_CXX11 __cxx11:: 2025-05-07T20:26:27.1142720Z #define _GLIBCXX_NUMERIC_LIMITS 1 2025-05-07T20:26:27.1142823Z #define __cpp_range_based_for 201603L 2025-05-07T20:26:27.1142938Z #define __cpp_lib_exchange_function 201304 2025-05-07T20:26:27.1143050Z #define _GLIBCXX_HAVE_INTTYPES_H 1 2025-05-07T20:26:27.1143159Z #define _GLIBCXX_DARWIN_USE_64_BIT_INODE 1 2025-05-07T20:26:27.1143349Z #define cudaCooperativeLaunchMultiDeviceNoPostSync 0x02 2025-05-07T20:26:27.1143448Z #define __FLT64_HAS_INFINITY__ 1 2025-05-07T20:26:27.1143542Z #define _GLIBCXX_CSTDLIB 1 2025-05-07T20:26:27.1143653Z #define _GLIBCXX_DEBUG_MACRO_SWITCH_H 1 2025-05-07T20:26:27.1143827Z #define __FLT64X_MAX__ 1.18973149535723176502126385303097021e+4932F64x 2025-05-07T20:26:27.1143948Z #define __STDCPP_DEFAULT_NEW_ALIGNMENT__ 16 2025-05-07T20:26:27.1144042Z #define _STRING_H 1 2025-05-07T20:26:27.1144145Z #define _BITS_PTHREADTYPES_H 1 2025-05-07T20:26:27.1144237Z #define _GCC_MAX_ALIGN_T 2025-05-07T20:26:27.1144348Z #define __SM_32_INTRINSICS_HPP__ 2025-05-07T20:26:27.1144488Z #define __SIG_ATOMIC_MIN__ (-__SIG_ATOMIC_MAX__ - 1) 2025-05-07T20:26:27.1144589Z #define __code_model_small__ 1 2025-05-07T20:26:27.1144679Z #define _PSTL_CONFIG_H 2025-05-07T20:26:27.1144784Z #define __GCC_ATOMIC_LONG_LOCK_FREE 2 2025-05-07T20:26:27.1144906Z #define __cpp_nontype_template_args 201411L 2025-05-07T20:26:27.1145004Z #define __SM_20_INTRINSICS_H__ 2025-05-07T20:26:27.1145107Z #define cudaCpuDeviceId ((int)-1) 2025-05-07T20:26:27.1145445Z #define assert(expr) ((expr) ? __ASSERT_VOID_CAST (0) : __assert_fail (__STRING(expr), __FILE__, __LINE__, __ASSERT_FUNCTION)) 2025-05-07T20:26:27.1145543Z #define __DEC32_MANT_DIG__ 7 2025-05-07T20:26:27.1145634Z #define le64toh(x) (x) 2025-05-07T20:26:27.1145734Z #define FILENAME_MAX 4096 2025-05-07T20:26:27.1145891Z #define __iscntrl_l(c,l) __isctype_l((c), _IScntrl, (l)) 2025-05-07T20:26:27.1146018Z #define __cpp_return_type_deduction 201304L 2025-05-07T20:26:27.1146101Z #define L_cuserid 9 2025-05-07T20:26:27.1146192Z #define __ino_t_defined 2025-05-07T20:26:27.1146279Z #define __k8__ 1 2025-05-07T20:26:27.1146380Z #define __INTPTR_TYPE__ long int 2025-05-07T20:26:27.1146493Z #define __UINT16_TYPE__ short unsigned int 2025-05-07T20:26:27.1146590Z #define __int8_t_defined 2025-05-07T20:26:27.1146682Z #define __WCHAR_TYPE__ int 2025-05-07T20:26:27.1146784Z #define __CLOCKID_T_TYPE __S32_TYPE 2025-05-07T20:26:27.1146906Z #define cudaHostRegisterPortable 0x01 2025-05-07T20:26:27.1147006Z #define __SLONGWORD_TYPE long int 2025-05-07T20:26:27.1147095Z #define _IOS_TRUNC 16 2025-05-07T20:26:27.1147223Z #define _GLIBCXX_PACKAGE_TARNAME "libstdc++" 2025-05-07T20:26:27.1147372Z #define __isblank_l(c,l) __isctype_l((c), _ISblank, (l)) 2025-05-07T20:26:27.1149001Z #define __HAVE_COLUMN 2025-05-07T20:26:27.1149094Z #define __stub_fdetach 2025-05-07T20:26:27.1149593Z #define __CUDACC_VER__ "__CUDACC_VER__ is no longer supported. Use __CUDACC_VER_MAJOR__, __CUDACC_VER_MINOR__, and __CUDACC_VER_BUILD__ instead." 2025-05-07T20:26:27.1149694Z #define __pic__ 2 2025-05-07T20:26:27.1149818Z #define __UINTPTR_MAX__ 0xffffffffffffffffUL 2025-05-07T20:26:27.1149918Z #define CLOCKS_PER_SEC 1000000l 2025-05-07T20:26:27.1150021Z #define __INT_FAST64_WIDTH__ 64 2025-05-07T20:26:27.1150124Z #define _GLIBCXX_HAVE_SOCKATMARK 1 2025-05-07T20:26:27.1150213Z #define __stub_chflags 2025-05-07T20:26:27.1150313Z #define CLOCK_BOOTTIME 7 2025-05-07T20:26:27.1150400Z #define __need_IOV_MAX 2025-05-07T20:26:27.1150517Z #define putc(_ch,_fp) _IO_putc (_ch, _fp) 2025-05-07T20:26:27.1150624Z #define __UQUAD_TYPE unsigned long int 2025-05-07T20:26:27.1150726Z #define __cpp_decltype 200707L 2025-05-07T20:26:27.1150841Z #define __BYTE_ORDER __LITTLE_ENDIAN 2025-05-07T20:26:27.1150935Z #define _GLIBCXX_USE_C99 1 2025-05-07T20:26:27.1151042Z #define _GLIBCXX_TR1_BETA_FUNCTION_TCC 1 2025-05-07T20:26:27.1151145Z #define TTY_NAME_MAX 32 2025-05-07T20:26:27.1151316Z #define _GLIBCXX_FORWARD(_Tp,__val) std::forward<_Tp>(__val) 2025-05-07T20:26:27.1151443Z #define __INT_FAST64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:26:27.1151624Z #define _PSTL_ASSERT(_Condition) __glibcxx_assert(_Condition) 2025-05-07T20:26:27.1151738Z #define __GCC_ATOMIC_TEST_AND_SET_TRUEVAL 1 2025-05-07T20:26:27.1151836Z #define __LITTLE_ENDIAN 1234 2025-05-07T20:26:27.1151941Z #define STA_PPSTIME 0x0004 2025-05-07T20:26:27.1152026Z #define __import__ 2025-05-07T20:26:27.1152130Z #define BUFSIZ _IO_BUFSIZ 2025-05-07T20:26:27.1152268Z #define M_SQRT2l 1.414213562373095048801688724209698079L 2025-05-07T20:26:27.1152354Z #define __export__ 2025-05-07T20:26:27.1152483Z #define __FSID_T_TYPE struct { int __val[2]; } 2025-05-07T20:26:27.1152588Z #define cudaMemAttachHost 0x02 2025-05-07T20:26:27.1152756Z #define __FLT_NORM_MAX__ 3.40282346638528859811704183484516925e+38F 2025-05-07T20:26:27.1152870Z #define _GLIBCXX_HAVE_ICONV 1 2025-05-07T20:26:27.1152964Z #define _GLIBCXX_SYMVER 1 2025-05-07T20:26:27.1153063Z #define __FLT64X_MAX_EXP__ 16384 2025-05-07T20:26:27.1153165Z #define _WCHAR_T_DECLARED 2025-05-07T20:26:27.1153288Z #define __UINT_FAST64_TYPE__ long unsigned int 2025-05-07T20:26:27.1153416Z #define isalpha_l(c,l) __isalpha_l ((c), (l)) 2025-05-07T20:26:27.1153525Z #define __cpp_inline_variables 201606L 2025-05-07T20:26:27.1153620Z #define WNOWAIT 0x01000000 2025-05-07T20:26:27.1153711Z #define PLOSS 6 2025-05-07T20:26:27.1153811Z #define M_LN10 2.30258509299404568402 2025-05-07T20:26:27.1154075Z #define _PSTL_UDS_PRESENT (__INTEL_COMPILER >= 1900 && __INTEL_COMPILER_BUILD_DATE >= 20180626) 2025-05-07T20:26:27.1154173Z #define EXIT_SUCCESS 0 2025-05-07T20:26:27.1154271Z #define __LDBL_REDIR_DECL(name) 2025-05-07T20:26:27.1154372Z #define _GLIBCXX_HAVE_STRTOF 1 2025-05-07T20:26:27.1154486Z #define MOD_FREQUENCY ADJ_FREQUENCY 2025-05-07T20:26:27.1154577Z #define __thread__ __thread 2025-05-07T20:26:27.1154683Z #define _GLIBCXX_HAVE_MEMORY_H 1 2025-05-07T20:26:27.1154786Z #define __INT_MAX__ 0x7fffffff 2025-05-07T20:26:27.1154893Z #define __SIZEOF_PTHREAD_BARRIER_T 32 2025-05-07T20:26:27.1155131Z #define __glibcxx_requires_partitioned_upper_pred(_First,_Last,_Value,_Pred) 2025-05-07T20:26:27.1155248Z #define __cudaCDP2StreamWaitEvent_ptsz 2025-05-07T20:26:27.1155343Z #define _GLIBCXX_HAVE_SINF 1 2025-05-07T20:26:27.1155432Z #define __linux__ 1 2025-05-07T20:26:27.1155530Z #define STA_PPSSIGNAL 0x0100 2025-05-07T20:26:27.1155659Z #define M_LN2l 0.693147180559945309417232121458176568L 2025-05-07T20:26:27.1155761Z #define __S16_TYPE short int 2025-05-07T20:26:27.1156111Z #define __glibcxx_constexpr_assert(cond) if (__builtin_is_constant_evaluated() && !bool(cond)) __builtin_unreachable() 2025-05-07T20:26:27.1156220Z #define __NVCC_DIAG_PRAGMA_SUPPORT__ 1 2025-05-07T20:26:27.1156511Z #define __bos(ptr) __builtin_object_size (ptr, __USE_FORTIFY_LEVEL > 1) 2025-05-07T20:26:27.1156613Z #define __COMMON_FUNCTIONS_H__ 2025-05-07T20:26:27.1156796Z #define UINT_MAX (INT_MAX * 2U + 1U) 2025-05-07T20:26:27.1156882Z #define _T_SIZE_ 2025-05-07T20:26:27.1156981Z #define LLONG_MAX __LONG_LONG_MAX__ 2025-05-07T20:26:27.1157110Z #define __cudaCDP2StreamCreateWithFlags 2025-05-07T20:26:27.1157207Z #define _PSTL_VERSION 12000 2025-05-07T20:26:27.1157330Z #define __noinline__ __attribute__((noinline)) 2025-05-07T20:26:27.1157435Z #define __WNOTHREAD 0x20000000 2025-05-07T20:26:27.1157533Z #define _G_va_list __gnuc_va_list 2025-05-07T20:26:27.1157663Z #define M_PI_4l 0.785398163397448309615660845819875721L 2025-05-07T20:26:27.1157772Z #define _IOS_INPUT 1 2025-05-07T20:26:27.1157898Z #define __USE_LARGEFILE64 1 2025-05-07T20:26:27.1158045Z #define _GLIBCXX_TR1_EXP_INTEGRAL_TCC 1 2025-05-07T20:26:27.1158172Z #define __INT64_TYPE__ long int 2025-05-07T20:26:27.1158299Z #define _POSIX_SSIZE_MAX 32767 2025-05-07T20:26:27.1158449Z #define __shared__ __location__(shared) 2025-05-07T20:26:27.1158547Z #define __FLT_MAX_EXP__ 128 2025-05-07T20:26:27.1158710Z #define __glibc_unlikely(cond) __builtin_expect((cond), 0) 2025-05-07T20:26:27.1158808Z #define __gid_t_defined 2025-05-07T20:26:27.1158926Z #define _GLIBCXX_USE_SC_NPROCESSORS_ONLN 1 2025-05-07T20:26:27.1159026Z #define __ORDER_BIG_ENDIAN__ 4321 2025-05-07T20:26:27.1159235Z #define __glibcxx_requires_can_increment_range(_First1,_Last1,_First2) 2025-05-07T20:26:27.1159337Z #define _GLIBCXX17_INLINE inline 2025-05-07T20:26:27.1159430Z #define __DBL_MANT_DIG__ 53 2025-05-07T20:26:27.1159526Z #define ___int_size_t_h 2025-05-07T20:26:27.1159636Z #define __FSBLKCNT64_T_TYPE __UQUAD_TYPE 2025-05-07T20:26:27.1159768Z #define __cpp_inheriting_constructors 201511L 2025-05-07T20:26:27.1159926Z #define __WIFCONTINUED(status) ((status) == __W_CONTINUED) 2025-05-07T20:26:27.1160033Z #define CUDA_DOUBLE_MATH_FUNCTIONS 1 2025-05-07T20:26:27.1160134Z #define _GLIBCXX_HAVE_FENV_H 1 2025-05-07T20:26:27.1160238Z #define _GLIBCXX_HAVE_STDBOOL_H 1 2025-05-07T20:26:27.1160335Z #define __SIZEOF_FLOAT128__ 16 2025-05-07T20:26:27.1160475Z #define __INT_LEAST64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:26:27.1160591Z #define _GLIBCXX_TR1_HYPERGEOMETRIC_TCC 1 2025-05-07T20:26:27.1160716Z #define _GLIBCXX_DEBUG_PEDASSERT(_Condition) 2025-05-07T20:26:27.1160821Z #define __clock_t_defined 1 2025-05-07T20:26:27.1160924Z #define _POSIX_SEM_VALUE_MAX 32767 2025-05-07T20:26:27.1161056Z #define __cudaCDP2RuntimeGetVersion 2025-05-07T20:26:27.1161156Z #define __GLIBC_MINOR__ 17 2025-05-07T20:26:27.1161274Z #define __DEC64_MIN__ 1E-383DD 2025-05-07T20:26:27.1161382Z #define __WINT_TYPE__ unsigned int 2025-05-07T20:26:27.1161494Z #define __UINT_LEAST32_TYPE__ unsigned int 2025-05-07T20:26:27.1161587Z #define __SIZEOF_SHORT__ 2 2025-05-07T20:26:27.1161766Z #define __FLT32_NORM_MAX__ 3.40282346638528859811704183484516925e+38F32 2025-05-07T20:26:27.1161850Z #define __SSE__ 1 2025-05-07T20:26:27.1161954Z #define SEM_VALUE_MAX (2147483647) 2025-05-07T20:26:27.1162057Z #define M_SQRT1_2 0.70710678118654752440 2025-05-07T20:26:27.1162145Z #define _CTYPE_H 1 2025-05-07T20:26:27.1162238Z #define __sigset_t_defined 2025-05-07T20:26:27.1162360Z #define __LDBL_MIN_EXP__ (-16381) 2025-05-07T20:26:27.1162459Z #define _GLIBCXX_HAVE_LOGF 1 2025-05-07T20:26:27.1162556Z #define MOD_TAI ADJ_TAI 2025-05-07T20:26:27.1162656Z #define _IO_va_list __gnuc_va_list 2025-05-07T20:26:27.1162753Z #define _GLIBCXX_HAVE_LOGL 1 2025-05-07T20:26:27.1162846Z #define __SM_70_RT_H__ 2025-05-07T20:26:27.1162944Z #define _GLIBCXX_HAVE_WRITEV 1 2025-05-07T20:26:27.1163052Z #define cudaEventWaitDefault 0x00 2025-05-07T20:26:27.1163157Z #define _GLIBCXX_HAVE_EXPL 1 2025-05-07T20:26:27.1163320Z #define __FLT64_MAX__ 1.79769313486231570814527423731704357e+308F64 2025-05-07T20:26:27.1163418Z #define _POSIX_MAX_CANON 255 2025-05-07T20:26:27.1163535Z #define _GLIBCXX_NOEXCEPT_PARM , bool _NE 2025-05-07T20:26:27.1163633Z #define FD_SETSIZE __FD_SETSIZE 2025-05-07T20:26:27.1163833Z #define _GLIBCXX_TXN_SAFE 2025-05-07T20:26:27.1163923Z #define __amd64__ 1 2025-05-07T20:26:27.1164093Z #define __WINT_WIDTH__ 32 2025-05-07T20:26:27.1164211Z #define __CUDA_DEVICE_RUNTIME_API_H__ 2025-05-07T20:26:27.1164479Z #define __REDIRECT_NTHNL(name,proto,alias) name proto __THROWNL __asm__ (__ASMNAME (#alias)) 2025-05-07T20:26:27.1164582Z #define _GLIBCXX_STDIO_SEEK_CUR 1 2025-05-07T20:26:27.1164672Z #define EOF (-1) 2025-05-07T20:26:27.1164771Z #define __WAIT_STATUS_DEFN void * 2025-05-07T20:26:27.1164867Z #define __USE_POSIX199309 1 2025-05-07T20:26:27.1164974Z #define __INT_LEAST64_WIDTH__ 64 2025-05-07T20:26:27.1165070Z #define __LDBL_MAX_EXP__ 16384 2025-05-07T20:26:27.1165176Z #define __FLT32X_MAX_10_EXP__ 308 2025-05-07T20:26:27.1165281Z #define LLONG_MIN (-LLONG_MAX-1) 2025-05-07T20:26:27.1165398Z #define cudaSurfaceType2DLayered 0xF2 2025-05-07T20:26:27.1165501Z #define ____mbstate_t_defined 1 2025-05-07T20:26:27.1165592Z #define STA_NANO 0x2000 2025-05-07T20:26:27.1165717Z #define _GLIBCXX_HAVE_LOG10F 1 2025-05-07T20:26:27.1165856Z #define _GLIBCXX_HAVE_LOG10L 1 2025-05-07T20:26:27.1165988Z #define _IO_LINKED 0x80 2025-05-07T20:26:27.1180093Z #define __cpp_lib_launder 201606 2025-05-07T20:26:27.1180239Z #define __SIZEOF_INT128__ 16 2025-05-07T20:26:27.1180349Z #define __PTHREAD_MUTEX_HAVE_PREV 1 2025-05-07T20:26:27.1180450Z #define __FLT64X_IS_IEC_60559__ 2 2025-05-07T20:26:27.1180550Z #define _GLIBCXX_TYPE_TRAITS 1 2025-05-07T20:26:27.1180701Z #define cudaGraphKernelNodePortProgrammatic 1 2025-05-07T20:26:27.1180819Z #define __DEVICE_ATOMIC_FUNCTIONS_HPP__ 2025-05-07T20:26:27.1180924Z #define __BLKCNT64_T_TYPE __SQUAD_TYPE 2025-05-07T20:26:27.1181029Z #define __LDBL_MAX_10_EXP__ 4932 2025-05-07T20:26:27.1181127Z #define __W_CONTINUED 0xffff 2025-05-07T20:26:27.1181221Z #define __ATOMIC_RELAXED 0 2025-05-07T20:26:27.1181354Z #define w_coredump __wait_terminated.__w_coredump 2025-05-07T20:26:27.1181476Z #define __FSBLKCNT_T_TYPE __SYSCALL_ULONG_TYPE 2025-05-07T20:26:27.1181709Z #define __cudaCDP2OccupancyMaxActiveBlocksPerMultiprocessor 2025-05-07T20:26:27.1181903Z #define __DBL_EPSILON__ double(2.22044604925031308084726333618164062e-16L) 2025-05-07T20:26:27.1181987Z #define __stub_stty 2025-05-07T20:26:27.1182155Z #define _tolower(c) ((int) (*__ctype_tolower_loc ())[(int) (c)]) 2025-05-07T20:26:27.1182253Z #define le16toh(x) (x) 2025-05-07T20:26:27.1182361Z #define BC_SCALE_MAX _POSIX2_BC_SCALE_MAX 2025-05-07T20:26:27.1182540Z #define __FLT128_MIN__ 3.36210314311209350626267781732175260e-4932F128 2025-05-07T20:26:27.1182622Z #define _SIZET_ 2025-05-07T20:26:27.1182713Z #define XATTR_NAME_MAX 255 2025-05-07T20:26:27.1182803Z #define _SVID_SOURCE 1 2025-05-07T20:26:27.1182883Z #define _LP64 1 2025-05-07T20:26:27.1182973Z #define _LIBC_LIMITS_H_ 1 2025-05-07T20:26:27.1183210Z #define __REDIRECT_NTH_LDBL(name,proto,alias) __REDIRECT_NTH (name, proto, alias) 2025-05-07T20:26:27.1183321Z #define _GLIBCXX_TR1_BESSEL_FUNCTION_TCC 1 2025-05-07T20:26:27.1183411Z #define __UINT8_C(c) c 2025-05-07T20:26:27.1183509Z #define _GLIBCXX_HAVE_CEILF 1 2025-05-07T20:26:27.1183603Z #define _GLIBCXX_HAVE_CEILL 1 2025-05-07T20:26:27.1183720Z #define __cudaCDP2Memset3DAsync_ptsz 2025-05-07T20:26:27.1183827Z #define __CUDA_ARCH_LIST__ 520 2025-05-07T20:26:27.1183926Z #define __FLT64_MAX_EXP__ 1024 2025-05-07T20:26:27.1184038Z #define MOD_MAXERROR ADJ_MAXERROR 2025-05-07T20:26:27.1184127Z #define CUDARTAPI 2025-05-07T20:26:27.1184215Z #define IOV_MAX 1024 2025-05-07T20:26:27.1184371Z #define __glibcxx_requires_irreflexive2(_First,_Last) 2025-05-07T20:26:27.1184475Z #define __INT_LEAST32_TYPE__ int 2025-05-07T20:26:27.1184582Z #define cudaMemAttachSingle 0x04 2025-05-07T20:26:27.1184676Z #define __wchar_t__ 2025-05-07T20:26:27.1184782Z #define __cpp_lib_is_aggregate 201703 2025-05-07T20:26:27.1184871Z #define SEEK_END 2 2025-05-07T20:26:27.1184977Z #define __SIZEOF_WCHAR_T__ 4 2025-05-07T20:26:27.1185153Z #define _GLIBCXX_USE_TBB_PAR_BACKEND __has_include() 2025-05-07T20:26:27.1185478Z #define _IO_ftrylockfile(_fp) 2025-05-07T20:26:27.1185637Z #define _GLIBCXX_USE_C99_WCHAR _GLIBCXX11_USE_C99_WCHAR 2025-05-07T20:26:27.1185817Z #define ____FILE_defined 1 2025-05-07T20:26:27.1185948Z #define _GLIBCXX_HAVE_BUILTIN_IS_AGGREGATE 1 2025-05-07T20:26:27.1186060Z #define __GNUC_PATCHLEVEL__ 0 2025-05-07T20:26:27.1186154Z #define _ISOC99_SOURCE 1 2025-05-07T20:26:27.1186260Z #define __VECTOR_FUNCTIONS_H__ 2025-05-07T20:26:27.1186512Z #define __REDIRECT_NTH(name,proto,alias) name proto __THROW __asm__ (__ASMNAME (#alias)) 2025-05-07T20:26:27.1186647Z #define _PSTL_USE_NONTEMPORAL_STORES_IF_ALLOWED 2025-05-07T20:26:27.1186744Z #define _IO_RIGHT 04 2025-05-07T20:26:27.1186843Z #define __END_NAMESPACE_STD 2025-05-07T20:26:27.1187034Z #define __FLT128_NORM_MAX__ 1.18973149535723176508575932662800702e+4932F128 2025-05-07T20:26:27.1187135Z #define _GLIBCXX_STD_C std 2025-05-07T20:26:27.1187259Z #define cudaInitDeviceFlagsAreValid 0x01 2025-05-07T20:26:27.1187366Z #define _LARGEFILE64_SOURCE 1 2025-05-07T20:26:27.1187482Z #define _GLIBCXX_USE_C99_STDINT_TR1 1 2025-05-07T20:26:27.1187570Z #define _STDDEF_H_ 2025-05-07T20:26:27.1187759Z #define __FLT64_NORM_MAX__ 1.79769313486231570814527423731704357e+308F64 2025-05-07T20:26:27.1187862Z #define __FLT128_HAS_QUIET_NAN__ 1 2025-05-07T20:26:27.1187984Z #define isalnum_l(c,l) __isalnum_l ((c), (l)) 2025-05-07T20:26:27.1188196Z #define __FD_ISSET(d,set) ((__FDS_BITS (set)[__FD_ELT (d)] & __FD_MASK (d)) != 0) 2025-05-07T20:26:27.1188314Z #define __INTMAX_MAX__ 0x7fffffffffffffffL 2025-05-07T20:26:27.1188460Z #define __glibcxx_requires_irreflexive(_First,_Last) 2025-05-07T20:26:27.1188595Z #define cudaGraphKernelNodePortDefault 0 2025-05-07T20:26:27.1188702Z #define __INT_FAST8_TYPE__ signed char 2025-05-07T20:26:27.1188822Z #define __cudaCDP2Memcpy3DAsync_ptsz 2025-05-07T20:26:27.1188922Z #define __PID_T_TYPE __S32_TYPE 2025-05-07T20:26:27.1189039Z #define __cpp_namespace_attributes 201411L 2025-05-07T20:26:27.1189145Z #define CHARCLASS_NAME_MAX 2048 2025-05-07T20:26:27.1189250Z #define _GLIBCXX_HAVE_TANF 1 2025-05-07T20:26:27.1189350Z #define _GLIBCXX_USE_ST_MTIM 1 2025-05-07T20:26:27.1189538Z #define __FLT64X_MIN__ 3.36210314311209350626267781732175260e-4932F64x 2025-05-07T20:26:27.1189636Z #define __CUDA_RUNTIME_H__ 2025-05-07T20:26:27.1190084Z #define WIFSIGNALED(status) __WIFSIGNALED (__WAIT_INT (status)) 2025-05-07T20:26:27.1190258Z #define _GLIBCXX_HAVE_STDLIB_H 1 2025-05-07T20:26:27.1190371Z #define __STDCPP_THREADS__ 1 2025-05-07T20:26:27.1190528Z #define M_2_SQRTPIl 1.128379167095512573896158903121545172L 2025-05-07T20:26:27.1190629Z #define __GNUC_STDC_INLINE__ 1 2025-05-07T20:26:27.1190727Z #define _POSIX_UIO_MAXIOV 16 2025-05-07T20:26:27.1190838Z #define _PSTL_PAR_BACKEND_SERIAL 2025-05-07T20:26:27.1190942Z #define P_tmpdir "/tmp" 2025-05-07T20:26:27.1191069Z #define __ASSERT_FUNCTION __PRETTY_FUNCTION__ 2025-05-07T20:26:27.1191177Z #define __FLT64_HAS_DENORM__ 1 2025-05-07T20:26:27.1191283Z #define __WORDSIZE_TIME64_COMPAT32 1 2025-05-07T20:26:27.1191459Z #define _GLIBCXX_DEPRECATED __attribute__ ((__deprecated__)) 2025-05-07T20:26:27.1191647Z #define __FLT32_EPSILON__ 1.19209289550781250000000000000000000e-7F32 2025-05-07T20:26:27.1191753Z #define _PSTL_HIDE_FROM_ABI_PUSH 2025-05-07T20:26:27.1191882Z #define cudaStreamLegacy ((cudaStream_t)0x1) 2025-05-07T20:26:27.1192004Z #define _IO_cleanup_region_start(_fct,_fp) 2025-05-07T20:26:27.1192116Z #define __location__(a) __annotate__(a) 2025-05-07T20:26:27.1192350Z #define __device_builtin_surface_type__ __location__(device_builtin_surface_type) 2025-05-07T20:26:27.1192451Z #define _POSIX2_BC_BASE_MAX 99 2025-05-07T20:26:27.1192572Z #define __cudaCDP2DeviceGetAttribute 2025-05-07T20:26:27.1192669Z #define __DBL_DECIMAL_DIG__ 17 2025-05-07T20:26:27.1192760Z #define __STDC_UTF_32__ 1 2025-05-07T20:26:27.1192863Z #define __INT_FAST8_WIDTH__ 8 2025-05-07T20:26:27.1192963Z #define NAN (__builtin_nanf ("")) 2025-05-07T20:26:27.1193063Z #define _POSIX_MQ_PRIO_MAX 32 2025-05-07T20:26:27.1193392Z #define __FXSR__ 1 2025-05-07T20:26:27.1193476Z #define _SIZE_T 2025-05-07T20:26:27.1193588Z #define _GLIBCXX_USE_GETTIMEOFDAY 1 2025-05-07T20:26:27.1193847Z #define cudaHostRegisterReadOnly 0x08 2025-05-07T20:26:27.1194024Z #define __FLT32X_MAX__ 1.79769313486231570814527423731704357e+308F32x 2025-05-07T20:26:27.1194184Z #define __WIFSTOPPED(status) (((status) & 0xff) == 0x7f) 2025-05-07T20:26:27.1194283Z #define _IO_ssize_t __ssize_t 2025-05-07T20:26:27.1194387Z #define __ULONG32_TYPE unsigned int 2025-05-07T20:26:27.1194579Z #define __DBL_NORM_MAX__ double(1.79769313486231570814527423731704357e+308L) 2025-05-07T20:26:27.1194782Z #define cudaStreamGraphTailLaunch (cudaStream_t)0x0100000000000000 2025-05-07T20:26:27.1194877Z #define _GXX_NULLPTR_T 2025-05-07T20:26:27.1195009Z #define __glibcxx_class_requires3(_a,_b,_c,_d) 2025-05-07T20:26:27.1195101Z #define FOPEN_MAX 16 2025-05-07T20:26:27.1195199Z #define __BIG_ENDIAN 4321 2025-05-07T20:26:27.1195321Z #define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__ 2025-05-07T20:26:27.1195428Z #define __suseconds_t_defined 2025-05-07T20:26:27.1195527Z #define __off_t_defined 2025-05-07T20:26:27.1195619Z #define stderr stderr 2025-05-07T20:26:27.1195715Z #define M_LOG10E 0.43429448190325182765 2025-05-07T20:26:27.1195837Z #define __glibcxx_requires_string(_String) 2025-05-07T20:26:27.1195944Z #define _GLIBCXX_HAVE_LDEXPL 1 2025-05-07T20:26:27.1196039Z #define __INTMAX_WIDTH__ 64 2025-05-07T20:26:27.1196450Z #define _PSTL_CPP14_2RANGE_MISMATCH_EQUAL_PRESENT (_MSC_VER >= 1900 || __cplusplus >= 201300L || __cpp_lib_robust_nonmodifying_seq_ops == 201304) 2025-05-07T20:26:27.1196545Z #define __mode_t_defined 2025-05-07T20:26:27.1196636Z #define _GCC_SIZE_T 2025-05-07T20:26:27.1196738Z #define __INO64_T_TYPE __UQUAD_TYPE 2025-05-07T20:26:27.1196844Z #define __cpp_runtime_arrays 198712L 2025-05-07T20:26:27.1196957Z #define __UINT64_TYPE__ long unsigned int 2025-05-07T20:26:27.1197054Z #define __USE_XOPEN2K8XSI 1 2025-05-07T20:26:27.1197150Z #define __UINT32_C(c) c ## U 2025-05-07T20:26:27.1197272Z #define __cpp_alias_templates 200704L 2025-05-07T20:26:27.1197382Z #define cudaHostAllocMapped 0x02 2025-05-07T20:26:27.1197495Z #define __DEVICE_LAUNCH_PARAMETERS_H__ 2025-05-07T20:26:27.1197598Z #define _STL_ITERATOR_H 1 2025-05-07T20:26:27.1197680Z #define __size_t__ 2025-05-07T20:26:27.1197814Z #define cudaStreamAttrID cudaLaunchAttributeID 2025-05-07T20:26:27.1197919Z #define _GLIBCXX_HAVE_ATANF 1 2025-05-07T20:26:27.1198032Z #define cudaEventRecordExternal 0x01 2025-05-07T20:26:27.1198192Z #define __isspace_l(c,l) __isctype_l((c), _ISspace, (l)) 2025-05-07T20:26:27.1198289Z #define _IO_BUFSIZ _G_BUFSIZ 2025-05-07T20:26:27.1198460Z #define __FLT_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F 2025-05-07T20:26:27.1198552Z #define _ENDIAN_H 1 2025-05-07T20:26:27.1198660Z #define __builtin_align__(a) __align__(a) 2025-05-07T20:26:27.1198760Z #define _GLIBCXX20_CONSTEXPR 2025-05-07T20:26:27.1198879Z #define __NV_NO_HOST_COMPILER_CHECK 1 2025-05-07T20:26:27.1198966Z #define __try try 2025-05-07T20:26:27.1199071Z #define _GLIBCXX_HAVE_FINITE 1 2025-05-07T20:26:27.1199175Z #define __FLT128_IS_IEC_60559__ 2 2025-05-07T20:26:27.1199276Z #define __INT8_MAX__ 0x7f 2025-05-07T20:26:27.1199615Z #define cudaStreamGetCaptureInfo __CUDART_API_PTSZ(cudaStreamGetCaptureInfo_v2) 2025-05-07T20:26:27.1199748Z #define __LONG_WIDTH__ 64 2025-05-07T20:26:27.1199866Z #define __PIC__ 2 2025-05-07T20:26:27.1199994Z #define BC_STRING_MAX _POSIX2_BC_STRING_MAX 2025-05-07T20:26:27.1200120Z #define __UINT_FAST32_TYPE__ long unsigned int 2025-05-07T20:26:27.1200254Z #define FD_ISSET(fd,fdsetp) __FD_ISSET (fd, fdsetp) 2025-05-07T20:26:27.1200362Z #define _GLIBCXX_HAVE_FLOAT_H 1 2025-05-07T20:26:27.1200461Z #define _GLIBCXX_HAVE_ATANL 1 2025-05-07T20:26:27.1200647Z #define __FLT32X_NORM_MAX__ 1.79769313486231570814527423731704357e+308F32x 2025-05-07T20:26:27.1200760Z #define __DEVICE_FUNCTIONS_HPP__ 2025-05-07T20:26:27.1200862Z #define __CHAR32_TYPE__ unsigned int 2025-05-07T20:26:27.1200953Z #define _IO_uid_t __uid_t 2025-05-07T20:26:27.1201158Z #define _GLIBCXX_HAVE_READLINK 1 2025-05-07T20:26:27.1201289Z #define __cudaCDP2EventRecordWithFlags_ptsz 2025-05-07T20:26:27.1201506Z #define _CONCEPT_CHECK_H 1 2025-05-07T20:26:27.1201659Z #define __FLT_MAX__ 3.40282346638528859811704183484516925e+38F 2025-05-07T20:26:27.1201764Z #define _GLIBCXX_HAVE_NETINET_IN_H 1 2025-05-07T20:26:27.1201898Z #define _GLIBCXX_TR1_SPECIAL_FUNCTION_UTIL_H 1 2025-05-07T20:26:27.1201983Z #define LONG_BIT 64 2025-05-07T20:26:27.1202101Z #define __SIZEOF_PTHREAD_BARRIERATTR_T 4 2025-05-07T20:26:27.1202215Z #define _GLIBCXX_USE_ALLOCATOR_NEW 1 2025-05-07T20:26:27.1202344Z #define __cpp_lib_math_special_functions 201603L 2025-05-07T20:26:27.1202441Z #define __fsfilcnt_t_defined 2025-05-07T20:26:27.1202543Z #define __blkcnt_t_defined 2025-05-07T20:26:27.1202812Z #define cudaKernelNodeAttributeMemSyncDomain cudaLaunchAttributeMemSyncDomain 2025-05-07T20:26:27.1202913Z #define __USE_LARGEFILE 1 2025-05-07T20:26:27.1203017Z #define __cpp_constexpr 201603L 2025-05-07T20:26:27.1203120Z #define CUDART_VERSION 12060 2025-05-07T20:26:27.1203219Z #define NL_TEXTMAX INT_MAX 2025-05-07T20:26:27.1203326Z #define cudaDeviceMapHost 0x08 2025-05-07T20:26:27.1203419Z #define _GLIBCXX_CMATH 1 2025-05-07T20:26:27.1203623Z #define __attribute_format_arg__(x) __attribute__ ((__format_arg__ (x))) 2025-05-07T20:26:27.1203717Z #define __lldiv_t_defined 1 2025-05-07T20:26:27.1203800Z #define __SSE2__ 1 2025-05-07T20:26:27.1203888Z #define _IOLBF 1 2025-05-07T20:26:27.1203991Z #define _GLIBCXX_HAVE_SYS_TYPES_H 1 2025-05-07T20:26:27.1204088Z #define _GLIBCXX_HAVE_FLOORF 1 2025-05-07T20:26:27.1204201Z #define __cpp_deduction_guides 201703L 2025-05-07T20:26:27.1204297Z #define _GLIBCXX_HAVE_EXPF 1 2025-05-07T20:26:27.1204414Z #define __annotate__(a) __attribute__((a)) 2025-05-07T20:26:27.1204506Z #define __INT32_TYPE__ int 2025-05-07T20:26:27.1204600Z #define __SIZEOF_DOUBLE__ 8 2025-05-07T20:26:27.1204715Z #define cudaDeviceSyncMemops 0x80 2025-05-07T20:26:27.1204823Z #define __cpp_exceptions 199711L 2025-05-07T20:26:27.1204920Z #define __FLT_MIN_10_EXP__ (-37) 2025-05-07T20:26:27.1205042Z #define cudaDeviceScheduleYield 0x02 2025-05-07T20:26:27.1205135Z #define _SYS_SYSMACROS_H 1 2025-05-07T20:26:27.1205252Z #define _GLIBCXX_TR1_LEGENDRE_FUNCTION_TCC 1 2025-05-07T20:26:27.1205418Z #define __FLT64_MIN__ 2.22507385850720138309023271733240406e-308F64 2025-05-07T20:26:27.1205517Z #define __INT_LEAST32_WIDTH__ 32 2025-05-07T20:26:27.1205617Z #define __SWORD_TYPE long int 2025-05-07T20:26:27.1205725Z #define __INTMAX_TYPE__ long int 2025-05-07T20:26:27.1205822Z #define _GLIBCXX11_USE_C99_MATH 1 2025-05-07T20:26:27.1205925Z #define __PTHREAD_SPINS 0, 0 2025-05-07T20:26:27.1206019Z #define _BITS_POSIX1_LIM_H 1 2025-05-07T20:26:27.1206299Z #define cudaStreamAttributeMemSyncDomainMap cudaLaunchAttributeMemSyncDomainMap 2025-05-07T20:26:27.1206403Z #define __DEC128_MAX_EXP__ 6145 2025-05-07T20:26:27.1206553Z #define math_errhandling (MATH_ERRNO | MATH_ERREXCEPT) 2025-05-07T20:26:27.1206641Z #define _T_SIZE 2025-05-07T20:26:27.1206756Z #define cudaHostAllocDefault 0x00 2025-05-07T20:26:27.1206887Z #define _PSTL_PRAGMA_SIMD_EXCLUSIVE_SCAN(PRM) 2025-05-07T20:26:27.1207016Z #define __va_arg_pack() __builtin_va_arg_pack () 2025-05-07T20:26:27.1207120Z #define _POSIX_TIMER_MAX 32 2025-05-07T20:26:27.1207211Z #define _GLIBCXX_HAVE_TLS 1 2025-05-07T20:26:27.1207342Z #define _GLIBCXX_NOTHROW _GLIBCXX_USE_NOEXCEPT 2025-05-07T20:26:27.1207439Z #define _GLIBCXX_HAVE_ACOSL 1 2025-05-07T20:26:27.1207539Z #define __FLT32X_HAS_QUIET_NAN__ 1 2025-05-07T20:26:27.1207637Z #define __ATOMIC_CONSUME 1 2025-05-07T20:26:27.1207813Z #define __CUDA_ARCH_HAS_FEATURE__(_FEAT) __CUDA_ARCH_FEAT_ ##_FEAT 2025-05-07T20:26:27.1207905Z #define __GNUC_MINOR__ 4 2025-05-07T20:26:27.1208013Z #define __GLIBCXX_TYPE_INT_N_0 __int128 2025-05-07T20:26:27.1208108Z #define __INT_FAST16_WIDTH__ 64 2025-05-07T20:26:27.1208225Z #define __UINTMAX_MAX__ 0xffffffffffffffffUL 2025-05-07T20:26:27.1208311Z #define __PIE__ 2 2025-05-07T20:26:27.1208507Z #define LITTLE_ENDIAN __LITTLE_ENDIAN 2025-05-07T20:26:27.1208608Z #define _GLIBCXX_HAVE_INT64_T_LONG 1 2025-05-07T20:26:27.1208881Z #define __FLT32X_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F32x 2025-05-07T20:26:27.1209105Z #define __intN_t(N,MODE) typedef int int ##N ##_t __attribute__ ((__mode__ (MODE))) 2025-05-07T20:26:27.1209205Z #define __nlink_t_defined 2025-05-07T20:26:27.1209336Z #define _GLIBCXX17_DEPRECATED [[__deprecated__]] 2025-05-07T20:26:27.1209458Z #define _PSTL_STRING(x) _PSTL_STRING_AUX(x) 2025-05-07T20:26:27.1209556Z #define _XOPEN_LIM_H 1 2025-05-07T20:26:27.1209815Z #define __u_intN_t(N,MODE) typedef unsigned int u_int ##N ##_t __attribute__ ((__mode__ (MODE))) 2025-05-07T20:26:27.1209937Z #define __cpp_template_template_args 201611L 2025-05-07T20:26:27.1210048Z #define _GTHREAD_USE_MUTEX_TIMEDLOCK 1 2025-05-07T20:26:27.1210150Z #define BC_DIM_MAX _POSIX2_BC_DIM_MAX 2025-05-07T20:26:27.1210254Z #define __DBL_MAX_10_EXP__ 308 2025-05-07T20:26:27.1210393Z #define __FILE_defined 1 2025-05-07T20:26:27.1210640Z #define __LDBL_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951L 2025-05-07T20:26:27.1210794Z #define _GLIBCXX_HAVE_SINCOS 1 2025-05-07T20:26:27.1210898Z #define __USE_XOPEN_EXTENDED 1 2025-05-07T20:26:27.1211033Z #define __cpp_lib_tuple_element_t 201402L 2025-05-07T20:26:27.1211205Z #define isascii_l(c,l) __isascii_l ((c), (l)) 2025-05-07T20:26:27.1211320Z #define cudaInvalidDeviceId ((int)-2) 2025-05-07T20:26:27.1211423Z #define _GLIBCXX_HAVE_SYS_RESOURCE_H 1 2025-05-07T20:26:27.1211519Z #define __INT16_C(c) c 2025-05-07T20:26:27.1211614Z #define __U32_TYPE unsigned int 2025-05-07T20:26:27.1211713Z #define _GLIBCXX_HAVE_SYS_IOCTL_H 1 2025-05-07T20:26:27.1211843Z #define FD_CLR(fd,fdsetp) __FD_CLR (fd, fdsetp) 2025-05-07T20:26:27.1211925Z #define __STDC__ 1 2025-05-07T20:26:27.1212030Z #define _GLIBCXX_HAVE_VWSCANF 1 2025-05-07T20:26:27.1212135Z #define _GLIBCXX_HAVE_EXECINFO_H 1 2025-05-07T20:26:27.1212236Z #define _GLIBCXX_USE_REALPATH 1 2025-05-07T20:26:27.1212405Z #define __attribute_malloc__ __attribute__ ((__malloc__)) 2025-05-07T20:26:27.1212496Z #define __FLT32X_DIG__ 15 2025-05-07T20:26:27.1212603Z #define _GLIBCXX_USE_C99_CTYPE_TR1 1 2025-05-07T20:26:27.1212711Z #define __PTRDIFF_TYPE__ long int 2025-05-07T20:26:27.1212829Z #define cudaArrayDeferredMapping 0x80 2025-05-07T20:26:27.1212945Z #define _GLIBCXX_END_NAMESPACE_CONTAINER 2025-05-07T20:26:27.1213050Z #define USHRT_MAX (SHRT_MAX * 2 + 1) 2025-05-07T20:26:27.1213155Z #define __cpp_lib_is_swappable 201603 2025-05-07T20:26:27.1213239Z #define stdin stdin 2025-05-07T20:26:27.1213337Z #define __ino64_t_defined 2025-05-07T20:26:27.1213427Z #define STA_CLK 0x8000 2025-05-07T20:26:27.1213531Z #define __clockid_t_defined 1 2025-05-07T20:26:27.1213680Z #define _GLIBCXX_NOEXCEPT_IF(...) noexcept(__VA_ARGS__) 2025-05-07T20:26:27.1213847Z #define __attribute_noinline__ __attribute__ ((__noinline__)) 2025-05-07T20:26:27.1213959Z #define __cudaCDP2MemsetAsync 2025-05-07T20:26:27.1214071Z #define _PSTL_PRAGMA_SIMD_SCAN(PRM) 2025-05-07T20:26:27.1214180Z #define _GLIBCXX_BEGIN_NAMESPACE_LDBL 2025-05-07T20:26:27.1214303Z #define _GLIBCXX_TR1_POLY_HERMITE_TCC 1 2025-05-07T20:26:27.1214500Z #define __FD_SET(d,set) ((void) (__FDS_BITS (set)[__FD_ELT (d)] |= __FD_MASK (d))) 2025-05-07T20:26:27.1214595Z #define __ATOMIC_SEQ_CST 5 2025-05-07T20:26:27.1215128Z #define __tobody(c,f,a,args) (__extension__ ({ int __res; if (sizeof (c) > 1) { if (__builtin_constant_p (c)) { int __c = (c); __res = __c < -128 || __c > 255 ? __c : (a)[__c]; } else __res = f args; } else __res = (a)[(int) (c)]; __res; })) 2025-05-07T20:26:27.1215214Z #define DOMAIN 1 2025-05-07T20:26:27.1215313Z #define M_LN2 0.69314718055994530942 2025-05-07T20:26:27.1215397Z #define __NVCC__ 1 2025-05-07T20:26:27.1215501Z #define __cudaCDP2Memset2DAsync 2025-05-07T20:26:27.1215624Z #define __CLOCK_T_TYPE __SYSCALL_SLONG_TYPE 2025-05-07T20:26:27.1215727Z #define _PSTL_PRAGMA_SIMD_EARLYEXIT 2025-05-07T20:26:27.1215831Z #define __throw_exception_again throw 2025-05-07T20:26:27.1216040Z #define M_SQRT2 1.41421356237309504880 2025-05-07T20:26:27.1216131Z #define __EXCEPTION_H 1 2025-05-07T20:26:27.1216307Z #define __FLT32X_MIN_10_EXP__ (-307) 2025-05-07T20:26:27.1216419Z #define HUGE_VAL (__builtin_huge_val()) 2025-05-07T20:26:27.1216725Z #define cudaStreamAttributeAccessPolicyWindow cudaLaunchAttributeAccessPolicyWindow 2025-05-07T20:26:27.1216843Z #define __UINTPTR_TYPE__ long unsigned int 2025-05-07T20:26:27.1216945Z #define _GLIBCXX_INLINE_VERSION 0 2025-05-07T20:26:27.1217042Z #define _GLIBCXX_USE_INT128 1 2025-05-07T20:26:27.1217157Z #define __cpp_lib_bool_constant 201505 2025-05-07T20:26:27.1217257Z #define PTHREAD_KEYS_MAX 1024 2025-05-07T20:26:27.1217401Z #define __DEC64_SUBNORMAL_MIN__ 0.000000000000001E-383DD 2025-05-07T20:26:27.1217519Z #define __FSFILCNT64_T_TYPE __UQUAD_TYPE 2025-05-07T20:26:27.1217631Z #define _GLIBCXX_DOUBLE_IS_IEEE_BINARY64 1 2025-05-07T20:26:27.1217728Z #define __DEC128_MANT_DIG__ 34 2025-05-07T20:26:27.1217849Z #define __cpp_lib_tuples_by_type 201304 2025-05-07T20:26:27.1217947Z #define __LDBL_MIN_10_EXP__ (-4931) 2025-05-07T20:26:27.1218063Z #define __cpp_generic_lambdas 201304L 2025-05-07T20:26:27.1218202Z #define _GLIBCXX_THROW_OR_ABORT(_EXC) (throw (_EXC)) 2025-05-07T20:26:27.1218300Z #define __useconds_t_defined 2025-05-07T20:26:27.1218407Z #define _GLIBCXX_USE_SCHED_YIELD 1 2025-05-07T20:26:27.1218594Z #define __attribute_deprecated__ __attribute__ ((__deprecated__)) 2025-05-07T20:26:27.1218745Z #define __cpp_lib_type_trait_variable_templates 201510L 2025-05-07T20:26:27.1218842Z #define __SSE_MATH__ 1 2025-05-07T20:26:27.1218935Z #define _IO_wint_t wint_t 2025-05-07T20:26:27.1219030Z #define __SIZEOF_LONG_LONG__ 8 2025-05-07T20:26:27.1219130Z #define _GLIBCXX_VERBOSE 1 2025-05-07T20:26:27.1219229Z #define _GLIBCXX_HAVE_ASINF 1 2025-05-07T20:26:27.1219354Z #define __cpp_user_defined_literals 200809L 2025-05-07T20:26:27.1219452Z #define _GLIBCXX_HAVE_ISINFL 1 2025-05-07T20:26:27.1219548Z #define _GLIBCXX_HAVE_ASINL 1 2025-05-07T20:26:27.1219648Z #define __USE_ATFILE 1 2025-05-07T20:26:27.1219741Z #define _POSIX_OPEN_MAX 20 2025-05-07T20:26:27.1219979Z #define _POSIX_LOGIN_NAME_MAX 9 2025-05-07T20:26:27.1220077Z #define _GCC_PTRDIFF_T 2025-05-07T20:26:27.1220307Z #define cudaKernelNodeAttributePriority cudaLaunchAttributePriority 2025-05-07T20:26:27.1220409Z #define __FLT128_DECIMAL_DIG__ 36 2025-05-07T20:26:27.1220516Z #define _POSIX_THREAD_KEYS_MAX 128 2025-05-07T20:26:27.1220620Z #define __GCC_ATOMIC_LLONG_LOCK_FREE 2 2025-05-07T20:26:27.1220732Z #define __cpp_lib_array_constexpr 201803L 2025-05-07T20:26:27.1220822Z #define _STDLIB_H 1 2025-05-07T20:26:27.1220965Z #define __exctype(name) extern int name (int) __THROW 2025-05-07T20:26:27.1221071Z #define __FLT32_HAS_QUIET_NAN__ 1 2025-05-07T20:26:27.1221170Z #define __FLT_DECIMAL_DIG__ 9 2025-05-07T20:26:27.1221300Z #define __UINT_FAST16_MAX__ 0xffffffffffffffffUL 2025-05-07T20:26:27.1221417Z #define __SURFACE_INDIRECT_FUNCTIONS_H__ 2025-05-07T20:26:27.1221519Z #define __SM_61_INTRINSICS_H__ 2025-05-07T20:26:27.1221705Z #define _GLIBCXX_PACKAGE_STRING "package-unused version-unused" 2025-05-07T20:26:27.1221873Z #define __isxdigit_l(c,l) __isctype_l((c), _ISxdigit, (l)) 2025-05-07T20:26:27.1221979Z #define __glibcxx_requires_nonempty() 2025-05-07T20:26:27.1222097Z #define w_stopsig __wait_stopped.__w_stopsig 2025-05-07T20:26:27.1222202Z #define __ldiv_t_defined 1 2025-05-07T20:26:27.1222381Z #define __glibcxx_requires_irreflexive_pred(_First,_Last,_Pred) 2025-05-07T20:26:27.1222483Z #define ___int_ptrdiff_t_h 2025-05-07T20:26:27.1222656Z #define __LDBL_NORM_MAX__ 1.18973149535723176502126385303097021e+4932L 2025-05-07T20:26:27.1222760Z #define __cudaCDP2EventDestroy 2025-05-07T20:26:27.1222862Z #define __HOST_DEFINES_H__ 2025-05-07T20:26:27.1222966Z #define __GCC_ATOMIC_SHORT_LOCK_FREE 2 2025-05-07T20:26:27.1223067Z #define __SM_20_ATOMIC_FUNCTIONS_H__ 2025-05-07T20:26:27.1223174Z #define _GLIBCXX_USE_NANOSLEEP 1 2025-05-07T20:26:27.1223255Z #define CUDART_CB 2025-05-07T20:26:27.1223448Z #define BC_BASE_MAX _POSIX2_BC_BASE_MAX 2025-05-07T20:26:27.1223585Z #define _GLIBCXX_USE_C99_INTTYPES_WCHAR_T_TR1 1 2025-05-07T20:26:27.1223749Z #define MB_LEN_MAX 16 2025-05-07T20:26:27.1223985Z #define __glibcxx_requires_partitioned_lower_pred(_First,_Last,_Value,_Pred) 2025-05-07T20:26:27.1224087Z #define _GLIBCXX11_USE_C99_WCHAR 1 2025-05-07T20:26:27.1224216Z #define _IO_peekc(_fp) _IO_peekc_unlocked (_fp) 2025-05-07T20:26:27.1224337Z #define _GLIBCXX_HAVE_AS_SYMVER_DIRECTIVE 1 2025-05-07T20:26:27.1224437Z #define _GLIBCXX_HAVE_UNISTD_H 1 2025-05-07T20:26:27.1224585Z #define __glibc_likely(cond) __builtin_expect((cond), 1) 2025-05-07T20:26:27.1224701Z #define __UINT_FAST8_TYPE__ unsigned char 2025-05-07T20:26:27.1224789Z #define _GNU_SOURCE 1 2025-05-07T20:26:27.1224877Z #define __stub_putmsg 2025-05-07T20:26:27.1224967Z #define __CUDACC__ 1 2025-05-07T20:26:27.1225059Z #define __N(msgid) (msgid) 2025-05-07T20:26:27.1225147Z #define __P(args) args 2025-05-07T20:26:27.1225412Z #define cudaKernelNodeAttributeCooperative cudaLaunchAttributeCooperative 2025-05-07T20:26:27.1225514Z #define __cpp_init_captures 201304L 2025-05-07T20:26:27.1225634Z #define _GLIBCXX17_CONSTEXPR constexpr 2025-05-07T20:26:27.1225727Z #define __ATOMIC_ACQ_REL 4 2025-05-07T20:26:27.1225829Z #define __cpp_lib_as_const 201510 2025-05-07T20:26:27.1225918Z #define __WCHAR_T 2025-05-07T20:26:27.1226010Z #define __ATOMIC_RELEASE 3 2025-05-07T20:26:27.1226106Z #define __fsblkcnt_t_defined 2025-05-07T20:26:27.1226230Z #define __cudaCDP2EventCreateWithFlags 2025-05-07T20:26:27.1226330Z #define __DEVICE_DOUBLE_FUNCTIONS_H__ 2025-05-07T20:26:27.1226337Z 2025-05-07T20:26:27.1442568Z 2025-05-07T20:26:27.1443031Z + conda run -n build_binary nvcc --version 2025-05-07T20:26:27.1443051Z 2025-05-07T20:26:29.0315368Z nvcc: NVIDIA (R) Cuda compiler driver 2025-05-07T20:26:29.0315757Z Copyright (c) 2005-2024 NVIDIA Corporation 2025-05-07T20:26:29.0316071Z Built on Tue_Oct_29_23:50:19_PDT_2024 2025-05-07T20:26:29.0316419Z Cuda compilation tools, release 12.6, V12.6.85 2025-05-07T20:26:29.0316752Z Build cuda_12.6.r12.6/compiler.35059454_0 2025-05-07T20:26:29.0316952Z 2025-05-07T20:26:29.0947434Z 2025-05-07T20:26:29.0960704Z /usr/bin/nvidia-smi 2025-05-07T20:26:29.0966449Z + nvidia-smi 2025-05-07T20:26:29.0966587Z 2025-05-07T20:26:29.1145001Z Wed May 7 20:26:29 2025 2025-05-07T20:26:29.1145407Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:26:29.1145908Z | NVIDIA-SMI 570.133.07 Driver Version: 570.133.07 CUDA Version: 12.8 | 2025-05-07T20:26:29.1146389Z |-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:26:29.1146877Z | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | 2025-05-07T20:26:29.1147393Z | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | 2025-05-07T20:26:29.1147818Z | | | MIG M. | 2025-05-07T20:26:29.1148197Z |=========================================+========================+======================| 2025-05-07T20:26:29.1316174Z | 0 NVIDIA A10G On | 00000000:00:1E.0 Off | 0 | 2025-05-07T20:26:29.1316619Z | 0% 27C P8 16W / 300W | 0MiB / 23028MiB | 0% Default | 2025-05-07T20:26:29.1316990Z | | | N/A | 2025-05-07T20:26:29.1317379Z +-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:26:29.1320950Z 2025-05-07T20:26:29.1321348Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:26:29.1321776Z | Processes: | 2025-05-07T20:26:29.1322562Z | GPU GI CI PID Type Process name GPU Memory | 2025-05-07T20:26:29.1323092Z | ID ID Usage | 2025-05-07T20:26:29.1323436Z |=========================================================================================| 2025-05-07T20:26:29.1327359Z | No running processes found | 2025-05-07T20:26:29.1327825Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:26:29.4045299Z 2025-05-07T20:26:29.4050231Z [INSTALL] Successfully installed CUDA 12.6.3 2025-05-07T20:26:29.4103772Z ##[group]Run . $PRELUDE; install_pytorch_pip $BUILD_ENV nightly cuda/12.6.3 2025-05-07T20:26:29.4104328Z . $PRELUDE; install_pytorch_pip $BUILD_ENV nightly cuda/12.6.3 2025-05-07T20:26:29.4116138Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:26:29.4116498Z env: 2025-05-07T20:26:29.4116721Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:26:29.4117014Z BUILD_ENV: build_binary 2025-05-07T20:26:29.4117258Z BUILD_TARGET: genai 2025-05-07T20:26:29.4117490Z BUILD_VARIANT: cuda 2025-05-07T20:26:29.4117722Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:26:29.4117965Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:26:29.4118262Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:26:29.4118587Z ##[endgroup] 2025-05-07T20:26:29.7505546Z ################################################################################ 2025-05-07T20:26:29.7505938Z # Install PyTorch (PIP) 2025-05-07T20:26:29.7506181Z # 2025-05-07T20:26:29.7520971Z # [2025-05-07T20:26:29.751Z] + install_pytorch_pip build_binary nightly cuda/12.6.3 2025-05-07T20:26:29.7521412Z ################################################################################ 2025-05-07T20:26:29.7521632Z 2025-05-07T20:26:29.7550982Z [EXEC] [ATTEMPT 0/3] + conda install -n build_binary -c conda-forge --override-channels -y numpy 2025-05-07T20:26:30.7518179Z Channels: 2025-05-07T20:26:30.7518436Z - conda-forge 2025-05-07T20:26:30.7518673Z Platform: linux-64 2025-05-07T20:26:34.0088315Z Collecting package metadata (repodata.json): - \ | / done 2025-05-07T20:26:34.7266315Z Solving environment: \ | / done 2025-05-07T20:26:34.9419370Z 2025-05-07T20:26:34.9420026Z ## Package Plan ## 2025-05-07T20:26:34.9420267Z 2025-05-07T20:26:34.9420572Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:26:34.9420989Z 2025-05-07T20:26:34.9421116Z added / updated specs: 2025-05-07T20:26:34.9421447Z - numpy 2025-05-07T20:26:34.9421610Z 2025-05-07T20:26:34.9421647Z 2025-05-07T20:26:34.9421814Z The following packages will be downloaded: 2025-05-07T20:26:34.9422109Z 2025-05-07T20:26:34.9422272Z package | build 2025-05-07T20:26:34.9422629Z ---------------------------|----------------- 2025-05-07T20:26:34.9423136Z libblas-3.9.0 |31_h59b9bed_openblas 16 KB conda-forge 2025-05-07T20:26:34.9423674Z libcblas-3.9.0 |31_he106b2a_openblas 16 KB conda-forge 2025-05-07T20:26:34.9424139Z libgfortran-15.1.0 | h69a702a_2 34 KB conda-forge 2025-05-07T20:26:34.9424577Z libgfortran5-15.1.0 | hcea5267_2 1.5 MB conda-forge 2025-05-07T20:26:34.9425024Z liblapack-3.9.0 |31_h7ac8fdf_openblas 16 KB conda-forge 2025-05-07T20:26:34.9425492Z libopenblas-0.3.29 |pthreads_h94d23a6_0 5.6 MB conda-forge 2025-05-07T20:26:34.9425929Z numpy-2.2.5 | py310hefbff90_0 7.6 MB conda-forge 2025-05-07T20:26:34.9426312Z ------------------------------------------------------------ 2025-05-07T20:26:34.9426660Z Total: 14.8 MB 2025-05-07T20:26:34.9426867Z 2025-05-07T20:26:34.9427001Z The following NEW packages will be INSTALLED: 2025-05-07T20:26:34.9427524Z 2025-05-07T20:26:34.9427740Z libblas conda-forge/linux-64::libblas-3.9.0-31_h59b9bed_openblas 2025-05-07T20:26:34.9428229Z libcblas conda-forge/linux-64::libcblas-3.9.0-31_he106b2a_openblas 2025-05-07T20:26:34.9438241Z libgfortran conda-forge/linux-64::libgfortran-15.1.0-h69a702a_2 2025-05-07T20:26:34.9438983Z libgfortran5 conda-forge/linux-64::libgfortran5-15.1.0-hcea5267_2 2025-05-07T20:26:34.9439597Z liblapack conda-forge/linux-64::liblapack-3.9.0-31_h7ac8fdf_openblas 2025-05-07T20:26:34.9440151Z libopenblas conda-forge/linux-64::libopenblas-0.3.29-pthreads_h94d23a6_0 2025-05-07T20:26:34.9440944Z numpy conda-forge/linux-64::numpy-2.2.5-py310hefbff90_0 2025-05-07T20:26:34.9441285Z 2025-05-07T20:26:34.9441292Z 2025-05-07T20:26:34.9441297Z 2025-05-07T20:26:34.9441495Z Downloading and Extracting Packages: ...working... 2025-05-07T20:26:34.9441938Z numpy-2.2.5 | 7.6 MB | | 0% 2025-05-07T20:26:34.9442198Z 2025-05-07T20:26:34.9442599Z libopenblas-0.3.29 | 5.6 MB | | 0%  2025-05-07T20:26:34.9442840Z 2025-05-07T20:26:34.9442845Z 2025-05-07T20:26:34.9443073Z libgfortran5-15.1.0 | 1.5 MB | | 0%  2025-05-07T20:26:34.9443321Z 2025-05-07T20:26:34.9443325Z 2025-05-07T20:26:34.9443329Z 2025-05-07T20:26:34.9451965Z libgfortran-15.1.0 | 34 KB | | 0%  2025-05-07T20:26:34.9452280Z 2025-05-07T20:26:34.9452284Z 2025-05-07T20:26:34.9452288Z 2025-05-07T20:26:34.9452292Z 2025-05-07T20:26:34.9467601Z libblas-3.9.0 | 16 KB | | 0%  2025-05-07T20:26:34.9467948Z 2025-05-07T20:26:34.9467962Z 2025-05-07T20:26:34.9467966Z 2025-05-07T20:26:34.9467970Z 2025-05-07T20:26:34.9474183Z 2025-05-07T20:26:34.9475825Z libcblas-3.9.0 | 16 KB | | 0%  2025-05-07T20:26:34.9476188Z 2025-05-07T20:26:34.9476192Z 2025-05-07T20:26:34.9476195Z 2025-05-07T20:26:34.9476207Z 2025-05-07T20:26:34.9476211Z 2025-05-07T20:26:34.9476215Z 2025-05-07T20:26:35.0739737Z liblapack-3.9.0 | 16 KB | | 0%  2025-05-07T20:26:35.0740132Z 2025-05-07T20:26:35.0740136Z 2025-05-07T20:26:35.0740140Z 2025-05-07T20:26:35.0797357Z 2025-05-07T20:26:35.0882331Z libblas-3.9.0 | 16 KB | #########7 | 97%  2025-05-07T20:26:35.0882596Z 2025-05-07T20:26:35.0882601Z 2025-05-07T20:26:35.0955233Z 2025-05-07T20:26:35.1366439Z libgfortran-15.1.0 | 34 KB | ####7 | 47%  2025-05-07T20:26:35.1366724Z 2025-05-07T20:26:35.1366728Z 2025-05-07T20:26:35.1366732Z 2025-05-07T20:26:35.1369178Z 2025-05-07T20:26:35.1445761Z libblas-3.9.0 | 16 KB | ########## | 100%  2025-05-07T20:26:35.1446022Z 2025-05-07T20:26:35.1446027Z 2025-05-07T20:26:35.1446031Z 2025-05-07T20:26:35.2089124Z libgfortran-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:26:35.2089407Z 2025-05-07T20:26:35.2089411Z 2025-05-07T20:26:35.2089428Z 2025-05-07T20:26:35.2089432Z 2025-05-07T20:26:35.2096702Z 2025-05-07T20:26:35.2100342Z libcblas-3.9.0 | 16 KB | #########7 | 98%  2025-05-07T20:26:35.2103891Z numpy-2.2.5 | 7.6 MB | | 0% 2025-05-07T20:26:35.2104136Z 2025-05-07T20:26:35.2104141Z 2025-05-07T20:26:35.2104145Z 2025-05-07T20:26:35.2104148Z 2025-05-07T20:26:35.2104152Z 2025-05-07T20:26:35.2106244Z 2025-05-07T20:26:35.2146792Z liblapack-3.9.0 | 16 KB | #########7 | 98%  2025-05-07T20:26:35.2147162Z 2025-05-07T20:26:35.2147166Z 2025-05-07T20:26:35.2147170Z 2025-05-07T20:26:35.2147174Z 2025-05-07T20:26:35.2147178Z 2025-05-07T20:26:35.2147192Z 2025-05-07T20:26:35.2153558Z liblapack-3.9.0 | 16 KB | ########## | 100%  2025-05-07T20:26:35.2153924Z 2025-05-07T20:26:35.2153928Z 2025-05-07T20:26:35.2153932Z 2025-05-07T20:26:35.2153936Z 2025-05-07T20:26:35.2155276Z 2025-05-07T20:26:35.2632340Z libcblas-3.9.0 | 16 KB | ########## | 100%  2025-05-07T20:26:35.2632855Z 2025-05-07T20:26:35.2632859Z 2025-05-07T20:26:35.2707177Z libgfortran5-15.1.0 | 1.5 MB | 1 | 1%  2025-05-07T20:26:35.2707442Z 2025-05-07T20:26:35.2707447Z 2025-05-07T20:26:35.2707450Z 2025-05-07T20:26:35.2708566Z 2025-05-07T20:26:35.2789321Z libblas-3.9.0 | 16 KB | ########## | 100%  2025-05-07T20:26:35.2789657Z 2025-05-07T20:26:35.2789667Z 2025-05-07T20:26:35.2789673Z 2025-05-07T20:26:35.2789678Z 2025-05-07T20:26:35.2789683Z 2025-05-07T20:26:35.2794328Z 2025-05-07T20:26:35.2893220Z liblapack-3.9.0 | 16 KB | ########## | 100%  2025-05-07T20:26:35.2894098Z 2025-05-07T20:26:35.2894110Z 2025-05-07T20:26:35.2894121Z 2025-05-07T20:26:35.2899751Z libgfortran-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:26:35.2900334Z 2025-05-07T20:26:35.2900341Z 2025-05-07T20:26:35.2901108Z 2025-05-07T20:26:35.3007464Z libgfortran-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:26:35.3007756Z 2025-05-07T20:26:35.3033930Z libopenblas-0.3.29 | 5.6 MB | | 0%  2025-05-07T20:26:35.3034308Z 2025-05-07T20:26:35.3034317Z 2025-05-07T20:26:35.3034324Z 2025-05-07T20:26:35.3034333Z 2025-05-07T20:26:35.3034341Z 2025-05-07T20:26:35.3086510Z libcblas-3.9.0 | 16 KB | ########## | 100%  2025-05-07T20:26:35.3086890Z 2025-05-07T20:26:35.3086894Z 2025-05-07T20:26:35.3091447Z libgfortran5-15.1.0 | 1.5 MB | ########## | 100%  2025-05-07T20:26:35.3543272Z numpy-2.2.5 | 7.6 MB | ######5 | 66% 2025-05-07T20:26:35.3908823Z numpy-2.2.5 | 7.6 MB | ########## | 100% 2025-05-07T20:26:35.3909174Z 2025-05-07T20:26:35.3910347Z 2025-05-07T20:26:35.3915485Z libgfortran5-15.1.0 | 1.5 MB | ########## | 100%  2025-05-07T20:26:35.3915817Z 2025-05-07T20:26:35.3916087Z 2025-05-07T20:26:35.4064409Z libgfortran5-15.1.0 | 1.5 MB | ########## | 100%  2025-05-07T20:26:35.4066155Z 2025-05-07T20:26:35.4067072Z libopenblas-0.3.29 | 5.6 MB | ########## | 100%  2025-05-07T20:26:35.4067849Z 2025-05-07T20:26:35.5460724Z libopenblas-0.3.29 | 5.6 MB | ########## | 100%  2025-05-07T20:26:35.5461275Z 2025-05-07T20:26:35.8258311Z libopenblas-0.3.29 | 5.6 MB | ########## | 100%  2025-05-07T20:26:35.8266497Z numpy-2.2.5 | 7.6 MB | ########## | 100% 2025-05-07T20:26:35.8267055Z 2025-05-07T20:26:35.8267409Z 2025-05-07T20:26:35.8267716Z  2025-05-07T20:26:35.8268033Z 2025-05-07T20:26:35.8268040Z 2025-05-07T20:26:35.8268291Z  2025-05-07T20:26:35.8268592Z 2025-05-07T20:26:35.8268599Z 2025-05-07T20:26:35.8268604Z 2025-05-07T20:26:35.8268866Z  2025-05-07T20:26:35.8269169Z 2025-05-07T20:26:35.8269175Z 2025-05-07T20:26:35.8269189Z 2025-05-07T20:26:35.8269195Z 2025-05-07T20:26:35.8269460Z  2025-05-07T20:26:35.8269759Z 2025-05-07T20:26:35.8269765Z 2025-05-07T20:26:35.8269770Z 2025-05-07T20:26:35.8269775Z 2025-05-07T20:26:35.8269780Z 2025-05-07T20:26:35.8270005Z  2025-05-07T20:26:35.8270223Z 2025-05-07T20:26:35.8270228Z 2025-05-07T20:26:35.8270233Z 2025-05-07T20:26:35.8270238Z 2025-05-07T20:26:35.8270243Z 2025-05-07T20:26:35.8270248Z 2025-05-07T20:26:35.8270522Z  done 2025-05-07T20:26:35.9276257Z Preparing transaction: \ done 2025-05-07T20:26:36.1283616Z Verifying transaction: / - done 2025-05-07T20:26:36.2293180Z Executing transaction: | done 2025-05-07T20:26:36.4080268Z ################################################################################ 2025-05-07T20:26:36.4080672Z # Install Package From PyTorch PIP: torch 2025-05-07T20:26:36.4081359Z # 2025-05-07T20:26:36.4096695Z # [2025-05-07T20:26:36.409Z] + install_from_pytorch_pip build_binary torch nightly cuda/12.6.3 2025-05-07T20:26:36.4097165Z ################################################################################ 2025-05-07T20:26:36.4097388Z 2025-05-07T20:26:36.4112435Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:26:36.5081644Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:26:36.5082011Z ################################################################################ 2025-05-07T20:26:36.5082339Z # Prepare PIP Arguments (PyTorch PIP) 2025-05-07T20:26:36.5082621Z # 2025-05-07T20:26:36.5102523Z # [2025-05-07T20:26:36.509Z] + __prepare_pip_arguments torch nightly cuda/12.6.3 2025-05-07T20:26:36.5102963Z ################################################################################ 2025-05-07T20:26:36.5103184Z 2025-05-07T20:26:36.5126231Z [INSTALL] Extracted package (channel, version): (nightly, LATEST) 2025-05-07T20:26:36.5151357Z [INSTALL] Extracted package variant: cu126 2025-05-07T20:26:36.5167904Z [INSTALL] Using a non-RELEASE channel: nightly ... 2025-05-07T20:26:36.5168464Z [INSTALL] Extracted the full PIP channel: https://download.pytorch.org/whl/nightly/cu126/ 2025-05-07T20:26:36.5176674Z [INSTALL] Extracted the full PIP package: --pre torch 2025-05-07T20:26:36.5185656Z [INSTALL] Attempting to install [torch, LATEST] from PyTorch PIP using channel https://download.pytorch.org/whl/nightly/cu126/ ... 2025-05-07T20:26:36.5207425Z [EXEC] [ATTEMPT 0/3] + conda run -n build_binary pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu126/ 2025-05-07T20:27:55.1826773Z Looking in indexes: https://download.pytorch.org/whl/nightly/cu126/ 2025-05-07T20:27:55.1828657Z Collecting torch 2025-05-07T20:27:55.1829632Z Downloading https://download.pytorch.org/whl/nightly/cu126/torch-2.8.0.dev20250507%2Bcu126-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (30 kB) 2025-05-07T20:27:55.1830386Z Collecting filelock (from torch) 2025-05-07T20:27:55.1830916Z Downloading https://download.pytorch.org/whl/nightly/filelock-3.16.1-py3-none-any.whl (16 kB) 2025-05-07T20:27:55.1831860Z Requirement already satisfied: typing-extensions>=4.10.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages (from torch) (4.13.2) 2025-05-07T20:27:55.1832609Z Collecting sympy>=1.13.3 (from torch) 2025-05-07T20:27:55.1833111Z Downloading https://download.pytorch.org/whl/nightly/sympy-1.13.3-py3-none-any.whl (6.2 MB) 2025-05-07T20:27:55.1833951Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.2/6.2 MB 45.5 MB/s eta 0:00:00 2025-05-07T20:27:55.1834331Z Collecting networkx (from torch) 2025-05-07T20:27:55.1834833Z Downloading https://download.pytorch.org/whl/nightly/networkx-3.4.2-py3-none-any.whl (1.7 MB) 2025-05-07T20:27:55.1835480Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.7/1.7 MB 19.0 MB/s eta 0:00:00 2025-05-07T20:27:55.1835831Z Collecting jinja2 (from torch) 2025-05-07T20:27:55.1836315Z Downloading https://download.pytorch.org/whl/nightly/jinja2-3.1.4-py3-none-any.whl (133 kB) 2025-05-07T20:27:55.1836818Z Collecting fsspec (from torch) 2025-05-07T20:27:55.1837315Z Downloading https://download.pytorch.org/whl/nightly/fsspec-2024.10.0-py3-none-any.whl (179 kB) 2025-05-07T20:27:55.1837875Z Collecting nvidia-cuda-nvrtc-cu12==12.6.77 (from torch) 2025-05-07T20:27:55.1838582Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cuda_nvrtc_cu12-12.6.77-py3-none-manylinux2014_x86_64.whl (23.7 MB) 2025-05-07T20:27:55.1839361Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 23.7/23.7 MB 72.5 MB/s eta 0:00:00 2025-05-07T20:27:55.1839782Z Collecting nvidia-cuda-runtime-cu12==12.6.77 (from torch) 2025-05-07T20:27:55.1840491Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cuda_runtime_cu12-12.6.77-py3-none-manylinux2014_x86_64.whl (897 kB) 2025-05-07T20:27:55.1841266Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 897.7/897.7 kB 10.9 MB/s eta 0:00:00 2025-05-07T20:27:55.1842508Z Collecting nvidia-cuda-cupti-cu12==12.6.80 (from torch) 2025-05-07T20:27:55.1843202Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cuda_cupti_cu12-12.6.80-py3-none-manylinux2014_x86_64.whl (8.9 MB) 2025-05-07T20:27:55.1843959Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 8.9/8.9 MB 46.0 MB/s eta 0:00:00 2025-05-07T20:27:55.1844350Z Collecting nvidia-cudnn-cu12==9.5.1.17 (from torch) 2025-05-07T20:27:55.1845020Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cudnn_cu12-9.5.1.17-py3-none-manylinux_2_28_x86_64.whl (571.0 MB) 2025-05-07T20:27:55.1845768Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 571.0/571.0 MB 34.5 MB/s eta 0:00:00 2025-05-07T20:27:55.1846353Z Collecting nvidia-cublas-cu12==12.6.4.1 (from torch) 2025-05-07T20:27:55.1847116Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cublas_cu12-12.6.4.1-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (393.1 MB) 2025-05-07T20:27:55.1847954Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 393.1/393.1 MB 67.4 MB/s eta 0:00:00 2025-05-07T20:27:55.1848330Z Collecting nvidia-cufft-cu12==11.3.0.4 (from torch) 2025-05-07T20:27:55.1848990Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cufft_cu12-11.3.0.4-py3-none-manylinux2014_x86_64.whl (200.2 MB) 2025-05-07T20:27:55.1849746Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 200.2/200.2 MB 165.0 MB/s eta 0:00:00 2025-05-07T20:27:55.1850118Z Collecting nvidia-curand-cu12==10.3.7.77 (from torch) 2025-05-07T20:27:55.1850786Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_curand_cu12-10.3.7.77-py3-none-manylinux2014_x86_64.whl (56.3 MB) 2025-05-07T20:27:55.1851556Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 56.3/56.3 MB 225.8 MB/s eta 0:00:00 2025-05-07T20:27:55.1852077Z Collecting nvidia-cusolver-cu12==11.7.1.2 (from torch) 2025-05-07T20:27:55.1852767Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cusolver_cu12-11.7.1.2-py3-none-manylinux2014_x86_64.whl (158.2 MB) 2025-05-07T20:27:55.1853558Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 158.2/158.2 MB 145.7 MB/s eta 0:00:00 2025-05-07T20:27:55.1853946Z Collecting nvidia-cusparse-cu12==12.5.4.2 (from torch) 2025-05-07T20:27:55.1854639Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cusparse_cu12-12.5.4.2-py3-none-manylinux2014_x86_64.whl (216.6 MB) 2025-05-07T20:27:55.1855398Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 216.6/216.6 MB 141.7 MB/s eta 0:00:00 2025-05-07T20:27:55.1855788Z Collecting nvidia-cusparselt-cu12==0.6.3 (from torch) 2025-05-07T20:27:55.1856488Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cusparselt_cu12-0.6.3-py3-none-manylinux2014_x86_64.whl (156.8 MB) 2025-05-07T20:27:55.1857271Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 156.8/156.8 MB 162.4 MB/s eta 0:00:00 2025-05-07T20:27:55.1857636Z Collecting nvidia-nccl-cu12==2.26.2 (from torch) 2025-05-07T20:27:55.1858393Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_nccl_cu12-2.26.2-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (2.0 kB) 2025-05-07T20:27:55.1859173Z Collecting nvidia-nvtx-cu12==12.6.77 (from torch) 2025-05-07T20:27:55.1859813Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_nvtx_cu12-12.6.77-py3-none-manylinux2014_x86_64.whl (89 kB) 2025-05-07T20:27:55.1860603Z Collecting nvidia-nvjitlink-cu12==12.6.85 (from torch) 2025-05-07T20:27:55.1861395Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_nvjitlink_cu12-12.6.85-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl (19.7 MB) 2025-05-07T20:27:55.1862258Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 19.7/19.7 MB 157.7 MB/s eta 0:00:00 2025-05-07T20:27:55.1862649Z Collecting nvidia-cufile-cu12==1.11.1.6 (from torch) 2025-05-07T20:27:55.1863444Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cufile_cu12-1.11.1.6-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.5 kB) 2025-05-07T20:27:55.1864260Z Collecting pytorch-triton==3.3.0+git96316ce5 (from torch) 2025-05-07T20:27:55.1865219Z Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.3.0%2Bgit96316ce5-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (1.6 kB) 2025-05-07T20:27:55.1866489Z Requirement already satisfied: setuptools>=40.8.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages (from pytorch-triton==3.3.0+git96316ce5->torch) (78.1.1) 2025-05-07T20:27:55.1867363Z Collecting mpmath<1.4,>=1.1.0 (from sympy>=1.13.3->torch) 2025-05-07T20:27:55.1867971Z Downloading https://download.pytorch.org/whl/nightly/mpmath-1.3.0-py3-none-any.whl (536 kB) 2025-05-07T20:27:55.1868703Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 536.2/536.2 kB 49.9 MB/s eta 0:00:00 2025-05-07T20:27:55.1869073Z Collecting MarkupSafe>=2.0 (from jinja2->torch) 2025-05-07T20:27:55.1869775Z Downloading https://download.pytorch.org/whl/nightly/MarkupSafe-2.1.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (25 kB) 2025-05-07T20:27:55.1870823Z Downloading https://download.pytorch.org/whl/nightly/cu126/torch-2.8.0.dev20250507%2Bcu126-cp310-cp310-manylinux_2_28_x86_64.whl (825.5 MB) 2025-05-07T20:27:55.1871642Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 825.5/825.5 MB 36.0 MB/s eta 0:00:00 2025-05-07T20:27:55.1872396Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cufile_cu12-1.11.1.6-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (1.1 MB) 2025-05-07T20:27:55.1873236Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.1/1.1 MB 11.7 MB/s eta 0:00:00 2025-05-07T20:27:55.1873988Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_nccl_cu12-2.26.2-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (201.3 MB) 2025-05-07T20:27:55.1874844Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 201.3/201.3 MB 106.3 MB/s eta 0:00:00 2025-05-07T20:27:55.1875625Z Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.3.0%2Bgit96316ce5-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (153.4 MB) 2025-05-07T20:27:55.1876504Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 153.4/153.4 MB 134.0 MB/s eta 0:00:00 2025-05-07T20:27:55.1878356Z Installing collected packages: nvidia-cusparselt-cu12, mpmath, sympy, pytorch-triton, nvidia-nvtx-cu12, nvidia-nvjitlink-cu12, nvidia-nccl-cu12, nvidia-curand-cu12, nvidia-cufile-cu12, nvidia-cuda-runtime-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-cupti-cu12, nvidia-cublas-cu12, networkx, MarkupSafe, fsspec, filelock, nvidia-cusparse-cu12, nvidia-cufft-cu12, nvidia-cudnn-cu12, jinja2, nvidia-cusolver-cu12, torch 2025-05-07T20:27:55.1880051Z 2025-05-07T20:27:55.1882132Z Successfully installed MarkupSafe-2.1.5 filelock-3.16.1 fsspec-2024.10.0 jinja2-3.1.4 mpmath-1.3.0 networkx-3.4.2 nvidia-cublas-cu12-12.6.4.1 nvidia-cuda-cupti-cu12-12.6.80 nvidia-cuda-nvrtc-cu12-12.6.77 nvidia-cuda-runtime-cu12-12.6.77 nvidia-cudnn-cu12-9.5.1.17 nvidia-cufft-cu12-11.3.0.4 nvidia-cufile-cu12-1.11.1.6 nvidia-curand-cu12-10.3.7.77 nvidia-cusolver-cu12-11.7.1.2 nvidia-cusparse-cu12-12.5.4.2 nvidia-cusparselt-cu12-0.6.3 nvidia-nccl-cu12-2.26.2 nvidia-nvjitlink-cu12-12.6.85 nvidia-nvtx-cu12-12.6.77 pytorch-triton-3.3.0+git96316ce5 sympy-1.13.3 torch-2.8.0.dev20250507+cu126 2025-05-07T20:27:55.1884886Z 2025-05-07T20:27:57.4076770Z torch 2.8.0.dev20250507+cu126 2025-05-07T20:27:57.4079466Z [CHECK] The installed package [torch, nightly/LATEST] is the correct variant (cu126) 2025-05-07T20:28:00.8470418Z [CHECK] Python (sub-)package 'torch.distributed' found ... 2025-05-07T20:28:04.3067849Z [CHECK] NOTE: The installed version is: 2.8.0.dev20250507+cu126 2025-05-07T20:28:04.3068383Z [CHECK] NOTE: Checking _GLIBCXX_USE_CXX11_ABI ... 2025-05-07T20:28:07.6778546Z True 2025-05-07T20:28:07.6778799Z True 2025-05-07T20:28:07.6779230Z 2025-05-07T20:28:07.7407410Z [INSTALL] Successfully installed PyTorch through PyTorch PIP 2025-05-07T20:28:07.7444110Z ##[group]Run if . $PRELUDE && which conda; then collect_pytorch_env_info $BUILD_ENV; fi 2025-05-07T20:28:07.7444713Z if . $PRELUDE && which conda; then collect_pytorch_env_info $BUILD_ENV; fi 2025-05-07T20:28:07.7456314Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:28:07.7456655Z env: 2025-05-07T20:28:07.7456879Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:28:07.7457166Z BUILD_ENV: build_binary 2025-05-07T20:28:07.7457407Z BUILD_TARGET: genai 2025-05-07T20:28:07.7457632Z BUILD_VARIANT: cuda 2025-05-07T20:28:07.7457865Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:28:07.7458110Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:28:07.7458408Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:28:07.7458732Z ##[endgroup] 2025-05-07T20:28:08.0835763Z /home/ec2-user/miniconda/bin/conda 2025-05-07T20:28:08.0837640Z ################################################################################ 2025-05-07T20:28:08.0838132Z # Collect PyTorch Environment Information (for Reporting Issues) 2025-05-07T20:28:08.0838490Z # 2025-05-07T20:28:08.0853321Z # [2025-05-07T20:28:08.085Z] + collect_pytorch_env_info build_binary 2025-05-07T20:28:08.0853737Z ################################################################################ 2025-05-07T20:28:08.0853946Z 2025-05-07T20:28:08.0870462Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:28:08.1771116Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:28:08.1781285Z [INFO] Downloading the PyTorch environment info collection script ... 2025-05-07T20:28:08.1781891Z + wget -q https://raw.githubusercontent.com/pytorch/pytorch/main/torch/utils/collect_env.py 2025-05-07T20:28:08.1782291Z 2025-05-07T20:28:08.2674112Z 2025-05-07T20:28:08.2674661Z [INFO] Collecting PyTorch environment info (will be needed for reporting issues to PyTorch) ... 2025-05-07T20:28:08.2696081Z [EXEC] [ATTEMPT 0/3] + conda run -n build_binary python collect_env.py 2025-05-07T20:28:14.2323609Z Collecting environment information... 2025-05-07T20:28:14.2323990Z PyTorch version: 2.8.0.dev20250507+cu126 2025-05-07T20:28:14.2324283Z Is debug build: False 2025-05-07T20:28:14.2324542Z CUDA used to build PyTorch: 12.6 2025-05-07T20:28:14.2324820Z ROCM used to build PyTorch: N/A 2025-05-07T20:28:14.2324993Z 2025-05-07T20:28:14.2325098Z OS: Amazon Linux 2023.6.20250317 (x86_64) 2025-05-07T20:28:14.2325418Z GCC version: (conda-forge gcc 11.4.0-13) 11.4.0 2025-05-07T20:28:14.2325734Z Clang version: Could not collect 2025-05-07T20:28:14.2326002Z CMake version: Could not collect 2025-05-07T20:28:14.2326269Z Libc version: glibc-2.34 2025-05-07T20:28:14.2326432Z 2025-05-07T20:28:14.2326735Z Python version: 3.10.13 | packaged by conda-forge | (main, Dec 23 2023, 15:36:39) [GCC 12.3.0] (64-bit runtime) 2025-05-07T20:28:14.2327339Z Python platform: Linux-6.1.130-139.222.amzn2023.x86_64-x86_64-with-glibc2.34 2025-05-07T20:28:14.2327735Z Is CUDA available: True 2025-05-07T20:28:14.2327985Z CUDA runtime version: 12.6.85 2025-05-07T20:28:14.2328270Z CUDA_MODULE_LOADING set to: LAZY 2025-05-07T20:28:14.2328731Z GPU models and configuration: GPU 0: NVIDIA A10G 2025-05-07T20:28:14.2329455Z Nvidia driver version: 570.133.07 2025-05-07T20:28:14.2330035Z cuDNN version: Could not collect 2025-05-07T20:28:14.2330381Z HIP runtime version: N/A 2025-05-07T20:28:14.2330768Z MIOpen runtime version: N/A 2025-05-07T20:28:14.2331253Z Is XNNPACK available: True 2025-05-07T20:28:14.2331464Z 2025-05-07T20:28:14.2331565Z CPU: 2025-05-07T20:28:14.2331898Z Architecture: x86_64 2025-05-07T20:28:14.2332408Z CPU op-mode(s): 32-bit, 64-bit 2025-05-07T20:28:14.2332907Z Address sizes: 48 bits physical, 48 bits virtual 2025-05-07T20:28:14.2333402Z Byte Order: Little Endian 2025-05-07T20:28:14.2333798Z CPU(s): 16 2025-05-07T20:28:14.2343517Z On-line CPU(s) list: 0-15 2025-05-07T20:28:14.2344129Z Vendor ID: AuthenticAMD 2025-05-07T20:28:14.2344489Z Model name: AMD EPYC 7R32 2025-05-07T20:28:14.2344811Z CPU family: 23 2025-05-07T20:28:14.2345103Z Model: 49 2025-05-07T20:28:14.2345392Z Thread(s) per core: 2 2025-05-07T20:28:14.2345689Z Core(s) per socket: 8 2025-05-07T20:28:14.2345965Z Socket(s): 1 2025-05-07T20:28:14.2346251Z Stepping: 0 2025-05-07T20:28:14.2346557Z BogoMIPS: 5599.62 2025-05-07T20:28:14.2348587Z Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:28:14.2350607Z Hypervisor vendor: KVM 2025-05-07T20:28:14.2350926Z Virtualization type: full 2025-05-07T20:28:14.2351262Z L1d cache: 256 KiB (8 instances) 2025-05-07T20:28:14.2351633Z L1i cache: 256 KiB (8 instances) 2025-05-07T20:28:14.2351997Z L2 cache: 4 MiB (8 instances) 2025-05-07T20:28:14.2352345Z L3 cache: 32 MiB (2 instances) 2025-05-07T20:28:14.2352669Z NUMA node(s): 1 2025-05-07T20:28:14.2352965Z NUMA node0 CPU(s): 0-15 2025-05-07T20:28:14.2353297Z Vulnerability Gather data sampling: Not affected 2025-05-07T20:28:14.2353679Z Vulnerability Itlb multihit: Not affected 2025-05-07T20:28:14.2354037Z Vulnerability L1tf: Not affected 2025-05-07T20:28:14.2354388Z Vulnerability Mds: Not affected 2025-05-07T20:28:14.2354744Z Vulnerability Meltdown: Not affected 2025-05-07T20:28:14.2355101Z Vulnerability Mmio stale data: Not affected 2025-05-07T20:28:14.2355467Z Vulnerability Reg file data sampling: Not affected 2025-05-07T20:28:14.2355998Z Vulnerability Retbleed: Mitigation; untrained return thunk; SMT enabled with STIBP protection 2025-05-07T20:28:14.2356576Z Vulnerability Spec rstack overflow: Mitigation; safe RET 2025-05-07T20:28:14.2357111Z Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl 2025-05-07T20:28:14.2357798Z Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization 2025-05-07T20:28:14.2358647Z Vulnerability Spectre v2: Mitigation; Retpolines; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected 2025-05-07T20:28:14.2359316Z Vulnerability Srbds: Not affected 2025-05-07T20:28:14.2359679Z Vulnerability Tsx async abort: Not affected 2025-05-07T20:28:14.2360004Z 2025-05-07T20:28:14.2360108Z Versions of relevant libraries: 2025-05-07T20:28:14.2360378Z [pip3] numpy==2.2.5 2025-05-07T20:28:14.2360626Z [pip3] nvidia-cublas-cu12==12.6.4.1 2025-05-07T20:28:14.2360933Z [pip3] nvidia-cuda-cupti-cu12==12.6.80 2025-05-07T20:28:14.2361238Z [pip3] nvidia-cuda-nvrtc-cu12==12.6.77 2025-05-07T20:28:14.2361556Z [pip3] nvidia-cuda-runtime-cu12==12.6.77 2025-05-07T20:28:14.2361869Z [pip3] nvidia-cudnn-cu12==9.5.1.17 2025-05-07T20:28:14.2362148Z [pip3] nvidia-cufft-cu12==11.3.0.4 2025-05-07T20:28:14.2362438Z [pip3] nvidia-curand-cu12==10.3.7.77 2025-05-07T20:28:14.2362733Z [pip3] nvidia-cusolver-cu12==11.7.1.2 2025-05-07T20:28:14.2363030Z [pip3] nvidia-cusparse-cu12==12.5.4.2 2025-05-07T20:28:14.2363441Z [pip3] nvidia-cusparselt-cu12==0.6.3 2025-05-07T20:28:14.2363738Z [pip3] nvidia-nccl-cu12==2.26.2 2025-05-07T20:28:14.2364015Z [pip3] nvidia-nvjitlink-cu12==12.6.85 2025-05-07T20:28:14.2364312Z [pip3] nvidia-nvtx-cu12==12.6.77 2025-05-07T20:28:14.2364606Z [pip3] pytorch-triton==3.3.0+git96316ce5 2025-05-07T20:28:14.2364906Z [pip3] torch==2.8.0.dev20250507+cu126 2025-05-07T20:28:14.2365266Z [conda] cuda-cudart 12.6.77 h5888daf_0 conda-forge 2025-05-07T20:28:14.2365752Z [conda] cuda-cudart-dev 12.6.77 h5888daf_0 conda-forge 2025-05-07T20:28:14.2366263Z [conda] cuda-cudart-dev_linux-64 12.6.77 h3f2d84a_0 conda-forge 2025-05-07T20:28:14.2366772Z [conda] cuda-cudart-static 12.6.77 h5888daf_0 conda-forge 2025-05-07T20:28:14.2367302Z [conda] cuda-cudart-static_linux-64 12.6.77 h3f2d84a_0 conda-forge 2025-05-07T20:28:14.2367827Z [conda] cuda-cudart_linux-64 12.6.77 h3f2d84a_0 conda-forge 2025-05-07T20:28:14.2368311Z [conda] cuda-cupti 12.6.80 hbd13f7d_0 conda-forge 2025-05-07T20:28:14.2368776Z [conda] cuda-cupti-dev 12.6.80 h5888daf_0 conda-forge 2025-05-07T20:28:14.2369266Z [conda] cuda-libraries 12.6.3 ha770c72_0 conda-forge 2025-05-07T20:28:14.2369758Z [conda] cuda-libraries-dev 12.6.3 ha770c72_0 conda-forge 2025-05-07T20:28:14.2370225Z [conda] cuda-nvrtc 12.6.85 hbd13f7d_0 conda-forge 2025-05-07T20:28:14.2370683Z [conda] cuda-nvrtc-dev 12.6.85 h5888daf_0 conda-forge 2025-05-07T20:28:14.2371138Z [conda] cuda-nvtx 12.6.77 hbd13f7d_0 conda-forge 2025-05-07T20:28:14.2371590Z [conda] cuda-opencl 12.6.77 hbd13f7d_0 conda-forge 2025-05-07T20:28:14.2372054Z [conda] cuda-opencl-dev 12.6.77 h5888daf_0 conda-forge 2025-05-07T20:28:14.2372533Z [conda] cuda-runtime 12.6.3 ha804496_0 conda-forge 2025-05-07T20:28:14.2372994Z [conda] libcublas 12.6.4.1 h5888daf_1 conda-forge 2025-05-07T20:28:14.2373499Z [conda] libcublas-dev 12.6.4.1 h5888daf_1 conda-forge 2025-05-07T20:28:14.2373968Z [conda] libcufft 11.3.0.4 hbd13f7d_0 conda-forge 2025-05-07T20:28:14.2374423Z [conda] libcufft-dev 11.3.0.4 h5888daf_0 conda-forge 2025-05-07T20:28:14.2374878Z [conda] libcurand 10.3.7.77 hbd13f7d_0 conda-forge 2025-05-07T20:28:14.2375332Z [conda] libcurand-dev 10.3.7.77 h5888daf_0 conda-forge 2025-05-07T20:28:14.2375801Z [conda] libcusolver 11.7.1.2 h5888daf_1 conda-forge 2025-05-07T20:28:14.2376279Z [conda] libcusolver-dev 11.7.1.2 h5888daf_1 conda-forge 2025-05-07T20:28:14.2376760Z [conda] libcusparse 12.5.4.2 hbd13f7d_0 conda-forge 2025-05-07T20:28:14.2377230Z [conda] libcusparse-dev 12.5.4.2 h5888daf_0 conda-forge 2025-05-07T20:28:14.2377711Z [conda] libnvjitlink 12.6.85 hbd13f7d_0 conda-forge 2025-05-07T20:28:14.2378289Z [conda] libnvjitlink-dev 12.6.85 h5888daf_0 conda-forge 2025-05-07T20:28:14.2378741Z [conda] numpy 2.2.5 py310hefbff90_0 conda-forge 2025-05-07T20:28:14.2379200Z [conda] nvidia-cublas-cu12 12.6.4.1 pypi_0 pypi 2025-05-07T20:28:14.2379694Z [conda] nvidia-cuda-cupti-cu12 12.6.80 pypi_0 pypi 2025-05-07T20:28:14.2380310Z [conda] nvidia-cuda-nvrtc-cu12 12.6.77 pypi_0 pypi 2025-05-07T20:28:14.2380804Z [conda] nvidia-cuda-runtime-cu12 12.6.77 pypi_0 pypi 2025-05-07T20:28:14.2381288Z [conda] nvidia-cudnn-cu12 9.5.1.17 pypi_0 pypi 2025-05-07T20:28:14.2381851Z [conda] nvidia-cufft-cu12 11.3.0.4 pypi_0 pypi 2025-05-07T20:28:14.2382320Z [conda] nvidia-curand-cu12 10.3.7.77 pypi_0 pypi 2025-05-07T20:28:14.2382806Z [conda] nvidia-cusolver-cu12 11.7.1.2 pypi_0 pypi 2025-05-07T20:28:14.2383348Z [conda] nvidia-cusparse-cu12 12.5.4.2 pypi_0 pypi 2025-05-07T20:28:14.2383841Z [conda] nvidia-cusparselt-cu12 0.6.3 pypi_0 pypi 2025-05-07T20:28:14.2384313Z [conda] nvidia-nccl-cu12 2.26.2 pypi_0 pypi 2025-05-07T20:28:14.2384787Z [conda] nvidia-nvjitlink-cu12 12.6.85 pypi_0 pypi 2025-05-07T20:28:14.2385260Z [conda] nvidia-nvtx-cu12 12.6.77 pypi_0 pypi 2025-05-07T20:28:14.2385728Z [conda] pytorch-triton 3.3.0+git96316ce5 pypi_0 pypi 2025-05-07T20:28:14.2386186Z [conda] torch 2.8.0.dev20250507+cu126 pypi_0 pypi 2025-05-07T20:28:14.2386459Z 2025-05-07T20:28:14.3073678Z ##[group]Run . $PRELUDE; cd fbgemm_gpu; prepare_fbgemm_gpu_build $BUILD_ENV 2025-05-07T20:28:14.3074355Z . $PRELUDE; cd fbgemm_gpu; prepare_fbgemm_gpu_build $BUILD_ENV 2025-05-07T20:28:14.3086962Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:28:14.3087305Z env: 2025-05-07T20:28:14.3087529Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:28:14.3087829Z BUILD_ENV: build_binary 2025-05-07T20:28:14.3088063Z BUILD_TARGET: genai 2025-05-07T20:28:14.3088295Z BUILD_VARIANT: cuda 2025-05-07T20:28:14.3088543Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:28:14.3088796Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:28:14.3089101Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:28:14.3089434Z ##[endgroup] 2025-05-07T20:28:14.6485653Z ################################################################################ 2025-05-07T20:28:14.6486182Z # Prepare FBGEMM-GPU Build 2025-05-07T20:28:14.6486501Z # 2025-05-07T20:28:14.6502076Z # [2025-05-07T20:28:14.649Z] + prepare_fbgemm_gpu_build build_binary 2025-05-07T20:28:14.6502640Z ################################################################################ 2025-05-07T20:28:14.6502952Z 2025-05-07T20:28:14.6517302Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:28:14.7456483Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:28:14.7479488Z [BUILD] Running git submodules update ... 2025-05-07T20:28:14.7500028Z [EXEC] [ATTEMPT 0/3] + git submodule sync 2025-05-07T20:28:14.7866440Z Synchronizing submodule url for '../external/asmjit' 2025-05-07T20:28:14.7867091Z Synchronizing submodule url for '../external/composable_kernel' 2025-05-07T20:28:14.7867658Z Synchronizing submodule url for '../external/cpuinfo' 2025-05-07T20:28:14.7868053Z Synchronizing submodule url for '../external/cutlass' 2025-05-07T20:28:14.7868475Z Synchronizing submodule url for '../external/googletest' 2025-05-07T20:28:14.7868938Z Synchronizing submodule url for '../external/hipify_torch' 2025-05-07T20:28:14.7869342Z Synchronizing submodule url for '../external/json' 2025-05-07T20:28:14.7902364Z [EXEC] [ATTEMPT 0/3] + git submodule update --init --recursive 2025-05-07T20:28:14.8458306Z [BUILD] Installing other build dependencies ... 2025-05-07T20:28:14.8480578Z [EXEC] [ATTEMPT 0/3] + conda run --no-capture-output -n build_binary python -m pip install -r requirements.txt 2025-05-07T20:28:17.2596065Z Collecting backports.tarfile (from -r requirements.txt (line 13)) 2025-05-07T20:28:17.2775086Z Downloading backports.tarfile-1.2.0-py3-none-any.whl.metadata (2.0 kB) 2025-05-07T20:28:17.3895540Z Collecting build (from -r requirements.txt (line 14)) 2025-05-07T20:28:17.3965826Z Downloading build-1.2.2.post1-py3-none-any.whl.metadata (6.5 kB) 2025-05-07T20:28:17.6359381Z Collecting cmake (from -r requirements.txt (line 15)) 2025-05-07T20:28:17.6393607Z Downloading cmake-4.0.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.3 kB) 2025-05-07T20:28:17.7558592Z Collecting click (from -r requirements.txt (line 16)) 2025-05-07T20:28:17.7784369Z Downloading click-8.1.8-py3-none-any.whl.metadata (2.3 kB) 2025-05-07T20:28:18.1262447Z Collecting hypothesis (from -r requirements.txt (line 17)) 2025-05-07T20:28:18.1293733Z Downloading hypothesis-6.131.14-py3-none-any.whl.metadata (5.6 kB) 2025-05-07T20:28:18.1884194Z Requirement already satisfied: jinja2 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages (from -r requirements.txt (line 18)) (3.1.4) 2025-05-07T20:28:18.1888138Z Requirement already satisfied: mpmath==1.3.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages (from -r requirements.txt (line 19)) (1.3.0) 2025-05-07T20:28:18.2621380Z Collecting ninja (from -r requirements.txt (line 20)) 2025-05-07T20:28:18.2650324Z Downloading ninja-1.11.1.4-py3-none-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.metadata (5.0 kB) 2025-05-07T20:28:18.3144654Z Requirement already satisfied: numpy>=2.0.2 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages (from -r requirements.txt (line 21)) (2.2.5) 2025-05-07T20:28:18.3807988Z Collecting pyre-extensions (from -r requirements.txt (line 22)) 2025-05-07T20:28:18.3837477Z Downloading pyre_extensions-0.0.32-py3-none-any.whl.metadata (4.0 kB) 2025-05-07T20:28:18.5165771Z Collecting pyyaml (from -r requirements.txt (line 23)) 2025-05-07T20:28:18.5199085Z Downloading PyYAML-6.0.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.1 kB) 2025-05-07T20:28:18.6320789Z Collecting scikit-build (from -r requirements.txt (line 24)) 2025-05-07T20:28:18.6374154Z Downloading scikit_build-0.18.1-py3-none-any.whl.metadata (18 kB) 2025-05-07T20:28:18.7008615Z Requirement already satisfied: setuptools in /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages (from -r requirements.txt (line 25)) (78.1.1) 2025-05-07T20:28:18.7689751Z Collecting setuptools_git_versioning (from -r requirements.txt (line 26)) 2025-05-07T20:28:18.7719084Z Downloading setuptools_git_versioning-2.1.0-py3-none-any.whl.metadata (6.1 kB) 2025-05-07T20:28:18.8769179Z Collecting tabulate (from -r requirements.txt (line 27)) 2025-05-07T20:28:18.8798981Z Downloading tabulate-0.9.0-py3-none-any.whl.metadata (34 kB) 2025-05-07T20:28:18.9907956Z Collecting patchelf (from -r requirements.txt (line 28)) 2025-05-07T20:28:18.9937937Z Downloading patchelf-0.17.2.2-py3-none-manylinux1_x86_64.manylinux_2_5_x86_64.musllinux_1_1_x86_64.whl.metadata (3.5 kB) 2025-05-07T20:28:19.1075374Z Collecting packaging>=19.1 (from build->-r requirements.txt (line 14)) 2025-05-07T20:28:19.1106815Z Downloading packaging-25.0-py3-none-any.whl.metadata (3.3 kB) 2025-05-07T20:28:19.2116168Z Collecting pyproject_hooks (from build->-r requirements.txt (line 14)) 2025-05-07T20:28:19.2165605Z Downloading pyproject_hooks-1.2.0-py3-none-any.whl.metadata (1.3 kB) 2025-05-07T20:28:19.3305232Z Collecting tomli>=1.1.0 (from build->-r requirements.txt (line 14)) 2025-05-07T20:28:19.3337869Z Downloading tomli-2.2.1-py3-none-any.whl.metadata (10 kB) 2025-05-07T20:28:19.4508021Z Collecting attrs>=22.2.0 (from hypothesis->-r requirements.txt (line 17)) 2025-05-07T20:28:19.4544239Z Downloading attrs-25.3.0-py3-none-any.whl.metadata (10 kB) 2025-05-07T20:28:19.5886356Z Collecting exceptiongroup>=1.0.0 (from hypothesis->-r requirements.txt (line 17)) 2025-05-07T20:28:19.5914845Z Downloading exceptiongroup-1.2.2-py3-none-any.whl.metadata (6.6 kB) 2025-05-07T20:28:19.6845691Z Collecting sortedcontainers<3.0.0,>=2.1.0 (from hypothesis->-r requirements.txt (line 17)) 2025-05-07T20:28:19.6878713Z Downloading sortedcontainers-2.4.0-py2.py3-none-any.whl.metadata (10 kB) 2025-05-07T20:28:19.7411464Z Requirement already satisfied: MarkupSafe>=2.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages (from jinja2->-r requirements.txt (line 18)) (2.1.5) 2025-05-07T20:28:19.7935780Z Collecting typing-inspect (from pyre-extensions->-r requirements.txt (line 22)) 2025-05-07T20:28:19.7964666Z Downloading typing_inspect-0.9.0-py3-none-any.whl.metadata (1.5 kB) 2025-05-07T20:28:19.8472049Z Requirement already satisfied: typing-extensions in /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages (from pyre-extensions->-r requirements.txt (line 22)) (4.13.2) 2025-05-07T20:28:19.9008994Z Collecting distro (from scikit-build->-r requirements.txt (line 24)) 2025-05-07T20:28:19.9037918Z Downloading distro-1.9.0-py3-none-any.whl.metadata (6.8 kB) 2025-05-07T20:28:19.9536242Z Requirement already satisfied: wheel>=0.32.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages (from scikit-build->-r requirements.txt (line 24)) (0.45.1) 2025-05-07T20:28:20.0214596Z Collecting mypy-extensions>=0.3.0 (from typing-inspect->pyre-extensions->-r requirements.txt (line 22)) 2025-05-07T20:28:20.0247694Z Downloading mypy_extensions-1.1.0-py3-none-any.whl.metadata (1.1 kB) 2025-05-07T20:28:20.0786255Z Downloading backports.tarfile-1.2.0-py3-none-any.whl (30 kB) 2025-05-07T20:28:20.1393356Z Downloading build-1.2.2.post1-py3-none-any.whl (22 kB) 2025-05-07T20:28:20.1978222Z Downloading cmake-4.0.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (27.9 MB) 2025-05-07T20:28:20.7237050Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 27.9/27.9 MB 53.0 MB/s eta 0:00:00 2025-05-07T20:28:20.7269768Z Downloading click-8.1.8-py3-none-any.whl (98 kB) 2025-05-07T20:28:20.7816264Z Downloading hypothesis-6.131.14-py3-none-any.whl (500 kB) 2025-05-07T20:28:20.8447832Z Downloading sortedcontainers-2.4.0-py2.py3-none-any.whl (29 kB) 2025-05-07T20:28:20.8986631Z Downloading ninja-1.11.1.4-py3-none-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (422 kB) 2025-05-07T20:28:20.9656858Z Downloading pyre_extensions-0.0.32-py3-none-any.whl (12 kB) 2025-05-07T20:28:21.0259712Z Downloading PyYAML-6.0.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (751 kB) 2025-05-07T20:28:21.0855121Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 751.2/751.2 kB 8.7 MB/s eta 0:00:00 2025-05-07T20:28:21.0931798Z Downloading scikit_build-0.18.1-py3-none-any.whl (85 kB) 2025-05-07T20:28:21.1402300Z Downloading setuptools_git_versioning-2.1.0-py3-none-any.whl (10 kB) 2025-05-07T20:28:21.1883009Z Downloading tabulate-0.9.0-py3-none-any.whl (35 kB) 2025-05-07T20:28:21.2405593Z Downloading patchelf-0.17.2.2-py3-none-manylinux1_x86_64.manylinux_2_5_x86_64.musllinux_1_1_x86_64.whl (466 kB) 2025-05-07T20:28:21.2940095Z Downloading attrs-25.3.0-py3-none-any.whl (63 kB) 2025-05-07T20:28:21.3425233Z Downloading exceptiongroup-1.2.2-py3-none-any.whl (16 kB) 2025-05-07T20:28:21.4034092Z Downloading packaging-25.0-py3-none-any.whl (66 kB) 2025-05-07T20:28:21.4563298Z Downloading tomli-2.2.1-py3-none-any.whl (14 kB) 2025-05-07T20:28:21.5039806Z Downloading distro-1.9.0-py3-none-any.whl (20 kB) 2025-05-07T20:28:21.5530277Z Downloading pyproject_hooks-1.2.0-py3-none-any.whl (10 kB) 2025-05-07T20:28:21.6026437Z Downloading typing_inspect-0.9.0-py3-none-any.whl (8.8 kB) 2025-05-07T20:28:21.6549948Z Downloading mypy_extensions-1.1.0-py3-none-any.whl (5.0 kB) 2025-05-07T20:28:21.8815883Z Installing collected packages: sortedcontainers, tomli, tabulate, pyyaml, pyproject_hooks, patchelf, packaging, ninja, mypy-extensions, exceptiongroup, distro, cmake, click, backports.tarfile, attrs, typing-inspect, setuptools_git_versioning, scikit-build, hypothesis, build, pyre-extensions 2025-05-07T20:28:24.2732469Z 2025-05-07T20:28:24.2804058Z Successfully installed attrs-25.3.0 backports.tarfile-1.2.0 build-1.2.2.post1 click-8.1.8 cmake-4.0.0 distro-1.9.0 exceptiongroup-1.2.2 hypothesis-6.131.14 mypy-extensions-1.1.0 ninja-1.11.1.4 packaging-25.0 patchelf-0.17.2.2 pyproject_hooks-1.2.0 pyre-extensions-0.0.32 pyyaml-6.0.2 scikit-build-0.18.1 setuptools_git_versioning-2.1.0 sortedcontainers-2.4.0 tabulate-0.9.0 tomli-2.2.1 typing-inspect-0.9.0 2025-05-07T20:28:24.4702333Z ################################################################################ 2025-05-07T20:28:24.4702739Z # Install PyTorch (PyTorch PIP) 2025-05-07T20:28:24.4703004Z # 2025-05-07T20:28:24.4719501Z # [2025-05-07T20:28:24.471Z] + install_triton_pip build_binary 2025-05-07T20:28:24.4719929Z ################################################################################ 2025-05-07T20:28:24.4720164Z 2025-05-07T20:28:24.4720385Z [BUILD] Installing pytorch-triton nightly/3.2.0+git4b3bb1f8 from PIP ... 2025-05-07T20:28:24.4720936Z ################################################################################ 2025-05-07T20:28:24.4721291Z # Install Package From PyTorch PIP: pytorch-triton 2025-05-07T20:28:24.4721596Z # 2025-05-07T20:28:24.4736112Z # [2025-05-07T20:28:24.473Z] + install_from_pytorch_pip build_binary pytorch-triton nightly/3.2.0+git4b3bb1f8 2025-05-07T20:28:24.4736639Z ################################################################################ 2025-05-07T20:28:24.4736847Z 2025-05-07T20:28:24.4751974Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:28:24.5676916Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:28:24.5677594Z ################################################################################ 2025-05-07T20:28:24.5677951Z # Prepare PIP Arguments (PyTorch PIP) 2025-05-07T20:28:24.5679973Z # 2025-05-07T20:28:24.5696712Z # [2025-05-07T20:28:24.569Z] + __prepare_pip_arguments pytorch-triton nightly/3.2.0+git4b3bb1f8 2025-05-07T20:28:24.5697234Z ################################################################################ 2025-05-07T20:28:24.5697453Z 2025-05-07T20:28:24.5745161Z [INSTALL] Extracted package (channel, version): (nightly, 3.2.0+git4b3bb1f8) 2025-05-07T20:28:24.5761766Z [INSTALL] Using a non-RELEASE channel: nightly ... 2025-05-07T20:28:24.5762275Z [INSTALL] Extracted the full PIP channel: https://download.pytorch.org/whl/nightly/ 2025-05-07T20:28:24.5771354Z [INSTALL] Extracted the full PIP package: --pre pytorch-triton==3.2.0+git4b3bb1f8 2025-05-07T20:28:24.5781451Z [INSTALL] Attempting to install [pytorch-triton, 3.2.0+git4b3bb1f8] from PyTorch PIP using channel https://download.pytorch.org/whl/nightly/ ... 2025-05-07T20:28:24.5802993Z [EXEC] [ATTEMPT 0/3] + conda run -n build_binary pip install --pre pytorch-triton==3.2.0+git4b3bb1f8 --index-url https://download.pytorch.org/whl/nightly/ 2025-05-07T20:28:32.3834495Z ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. 2025-05-07T20:28:32.3835817Z torch 2.8.0.dev20250507+cu126 requires pytorch-triton==3.3.0+git96316ce5; platform_system == "Linux" and platform_machine == "x86_64", but you have pytorch-triton 3.2.0+git4b3bb1f8 which is incompatible. 2025-05-07T20:28:32.3836538Z 2025-05-07T20:28:32.3836756Z Looking in indexes: https://download.pytorch.org/whl/nightly/ 2025-05-07T20:28:32.3837175Z Collecting pytorch-triton==3.2.0+git4b3bb1f8 2025-05-07T20:28:32.3837981Z Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.2.0%2Bgit4b3bb1f8-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (1.3 kB) 2025-05-07T20:28:32.3839281Z Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.2.0%2Bgit4b3bb1f8-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (166.5 MB) 2025-05-07T20:28:32.3840690Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 166.5/166.5 MB 53.4 MB/s eta 0:00:00 2025-05-07T20:28:32.3841079Z Installing collected packages: pytorch-triton 2025-05-07T20:28:32.3841416Z Attempting uninstall: pytorch-triton 2025-05-07T20:28:32.3841803Z Found existing installation: pytorch-triton 3.3.0+git96316ce5 2025-05-07T20:28:32.3842222Z Uninstalling pytorch-triton-3.3.0+git96316ce5: 2025-05-07T20:28:32.3842638Z Successfully uninstalled pytorch-triton-3.3.0+git96316ce5 2025-05-07T20:28:32.3843069Z Successfully installed pytorch-triton-3.2.0+git4b3bb1f8 2025-05-07T20:28:32.3843331Z 2025-05-07T20:28:34.5798963Z [CHECK] Python (sub-)package 'triton' found ... 2025-05-07T20:28:34.5802237Z [CHECK] Printing out the pytorch-triton version ... 2025-05-07T20:28:36.7275523Z ################################################################################ 2025-05-07T20:28:36.7276141Z [CHECK] The installed VERSION of pytorch-triton is: 3.2.0 2025-05-07T20:28:36.7276543Z ################################################################################ 2025-05-07T20:28:36.7276776Z 2025-05-07T20:28:38.7717890Z [CHECK] Python (sub-)package 'numpy' found ... 2025-05-07T20:28:40.8825612Z [CHECK] Python (sub-)package 'skbuild' found ... 2025-05-07T20:28:40.8829416Z [BUILD] Successfully ran git submodules update 2025-05-07T20:28:40.8884449Z ##[group]Run . $PRELUDE; install_fbgemm_gpu_wheel $BUILD_ENV *.whl 2025-05-07T20:28:40.8884934Z . $PRELUDE; install_fbgemm_gpu_wheel $BUILD_ENV *.whl 2025-05-07T20:28:40.8897103Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:28:40.8897444Z env: 2025-05-07T20:28:40.8897672Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:28:40.8898147Z BUILD_ENV: build_binary 2025-05-07T20:28:40.8898396Z BUILD_TARGET: genai 2025-05-07T20:28:40.8898624Z BUILD_VARIANT: cuda 2025-05-07T20:28:40.8898857Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:28:40.8899107Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:28:40.8899407Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:28:40.8899742Z ##[endgroup] 2025-05-07T20:28:41.2259742Z ################################################################################ 2025-05-07T20:28:41.2260235Z # Install FBGEMM-GPU from Wheel 2025-05-07T20:28:41.2260492Z # 2025-05-07T20:28:41.2277282Z # [2025-05-07T20:28:41.227Z] + install_fbgemm_gpu_wheel build_binary fbgemm_gpu_genai_nightly-2025.5.7-cp310-cp310-manylinux_2_28_x86_64.whl 2025-05-07T20:28:41.2278265Z ################################################################################ 2025-05-07T20:28:41.2278585Z 2025-05-07T20:28:41.2279103Z [INSTALL] Printing out FBGEMM-GPU wheel SHA: fbgemm_gpu_genai_nightly-2025.5.7-cp310-cp310-manylinux_2_28_x86_64.whl 2025-05-07T20:28:41.2280134Z + sha1sum fbgemm_gpu_genai_nightly-2025.5.7-cp310-cp310-manylinux_2_28_x86_64.whl 2025-05-07T20:28:41.2280472Z 2025-05-07T20:28:41.2396661Z 4d1609ed0721ee216ce1a19f96ff799eee4aae34 fbgemm_gpu_genai_nightly-2025.5.7-cp310-cp310-manylinux_2_28_x86_64.whl 2025-05-07T20:28:41.2399564Z 2025-05-07T20:28:41.2400091Z + sha256sum fbgemm_gpu_genai_nightly-2025.5.7-cp310-cp310-manylinux_2_28_x86_64.whl 2025-05-07T20:28:41.2400596Z 2025-05-07T20:28:41.2528644Z ad43f456d1673a9cf1f77f0929f0cfd284ec9b8069b0a67a8cf77246792fe8cf fbgemm_gpu_genai_nightly-2025.5.7-cp310-cp310-manylinux_2_28_x86_64.whl 2025-05-07T20:28:41.2531349Z 2025-05-07T20:28:41.2531801Z + md5sum fbgemm_gpu_genai_nightly-2025.5.7-cp310-cp310-manylinux_2_28_x86_64.whl 2025-05-07T20:28:41.2532155Z 2025-05-07T20:28:41.2756489Z c264a66986d7747c3b5c78c4d7455217 fbgemm_gpu_genai_nightly-2025.5.7-cp310-cp310-manylinux_2_28_x86_64.whl 2025-05-07T20:28:41.2759178Z 2025-05-07T20:28:41.2768382Z [INSTALL] Installing FBGEMM-GPU wheel: fbgemm_gpu_genai_nightly-2025.5.7-cp310-cp310-manylinux_2_28_x86_64.whl ... 2025-05-07T20:28:41.2790115Z [EXEC] [ATTEMPT 0/3] + conda run -n build_binary python -m pip install fbgemm_gpu_genai_nightly-2025.5.7-cp310-cp310-manylinux_2_28_x86_64.whl 2025-05-07T20:28:44.0016445Z Processing ./fbgemm_gpu_genai_nightly-2025.5.7-cp310-cp310-manylinux_2_28_x86_64.whl 2025-05-07T20:28:44.0017807Z Requirement already satisfied: numpy in /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages (from fbgemm-gpu-genai-nightly==2025.5.7) (2.2.5) 2025-05-07T20:28:44.0018628Z Installing collected packages: fbgemm-gpu-genai-nightly 2025-05-07T20:28:44.0019061Z Successfully installed fbgemm-gpu-genai-nightly-2025.5.7 2025-05-07T20:28:44.0019337Z 2025-05-07T20:28:50.9312878Z ################################################################################ 2025-05-07T20:28:50.9313290Z [CHECK] !!!! INFO !!!! 2025-05-07T20:28:50.9313689Z [CHECK] The installed version of PyTorch is: 2.8.0.dev20250507+cu126 2025-05-07T20:28:50.9314113Z [CHECK] CUDA version reported by PyTorch is: 12.6 2025-05-07T20:28:50.9314430Z [CHECK] 2025-05-07T20:28:50.9314749Z [CHECK] NOTE: If the PyTorch package channel is different from the FBGEMM_GPU 2025-05-07T20:28:50.9315241Z [CHECK] package channel; the package may be broken at runtime!!! 2025-05-07T20:28:50.9315667Z ################################################################################ 2025-05-07T20:28:50.9315876Z 2025-05-07T20:28:50.9316001Z [INSTALL] Checking imports and symbols ... 2025-05-07T20:28:54.8472973Z [CHECK] Python (sub-)package 'fbgemm_gpu' found ... 2025-05-07T20:28:58.8027644Z [CHECK] Found symbol '__version__' in Python package 'fbgemm_gpu'. 2025-05-07T20:29:02.7558604Z [CHECK] Found symbol '__variant__' in Python package 'fbgemm_gpu'. 2025-05-07T20:29:02.7562220Z [CHECK] Printing out the FBGEMM-GPU version ... 2025-05-07T20:29:14.5305629Z ################################################################################ 2025-05-07T20:29:14.5307756Z [CHECK] The installed FBGEMM TARGET is: genai 2025-05-07T20:29:14.5308198Z [CHECK] The installed FBGEMM VARIANT is: cuda 2025-05-07T20:29:14.5308729Z [CHECK] The installed FBGEMM VERSION is: 2025.5.7 2025-05-07T20:29:14.5309082Z ################################################################################ 2025-05-07T20:29:14.5309326Z 2025-05-07T20:29:22.3655406Z ################################################################################ 2025-05-07T20:29:22.3655855Z [CHECK] FBGEMM_GPU Experimental Packages 2025-05-07T20:29:22.3657255Z [CHECK] fbgemm_gpu: ['__annotations__', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__', '__target__', '__variant__', '__version__', '_load_library', 'docs', 'fbgemm_genai_libraries', 'fbgemm_gpu', 'fbgemm_gpu_libraries', 'libraries_to_load', 'library', 'logging', 'open_source', 'os', 'split_embedding_configs', 'split_table_batched_embeddings_ops_common', 'torch', 'utils'] 2025-05-07T20:29:22.3659013Z [CHECK] fbgemm_gpu.experimental: ['__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__'] 2025-05-07T20:29:22.3659537Z ################################################################################ 2025-05-07T20:29:22.3659754Z 2025-05-07T20:29:22.3660036Z [INSTALL] Check for installation of Python sources ... 2025-05-07T20:29:26.2904147Z [CHECK] Python (sub-)package 'fbgemm_gpu.config' found ... 2025-05-07T20:29:30.2002063Z [CHECK] Python (sub-)package 'fbgemm_gpu.docs' found ... 2025-05-07T20:29:34.2596161Z [CHECK] Python (sub-)package 'fbgemm_gpu.quantize' found ... 2025-05-07T20:29:38.1899070Z [CHECK] Python (sub-)package 'fbgemm_gpu.tbe.cache' found ... 2025-05-07T20:29:38.1902463Z [INSTALL] Check for operator registrations ... 2025-05-07T20:29:42.0506369Z fbgemm.nccl_init 2025-05-07T20:29:42.0506627Z 2025-05-07T20:29:42.1128861Z [CHECK] FBGEMM_GPU operator appears to be correctly registered: torch.ops.fbgemm.nccl_init 2025-05-07T20:29:45.9703944Z fbgemm.gqa_attn_splitk 2025-05-07T20:29:45.9704165Z 2025-05-07T20:29:46.0334814Z [CHECK] FBGEMM_GPU operator appears to be correctly registered: torch.ops.fbgemm.gqa_attn_splitk 2025-05-07T20:29:49.8871819Z fbgemm.rope_qkv_decoding 2025-05-07T20:29:49.8872113Z 2025-05-07T20:29:49.9494921Z [CHECK] FBGEMM_GPU operator appears to be correctly registered: torch.ops.fbgemm.rope_qkv_decoding 2025-05-07T20:29:49.9496215Z [INSTALL] FBGEMM-GPU installation through wheel completed ... 2025-05-07T20:29:49.9536908Z ##[group]Run . $PRELUDE; test_all_fbgemm_gpu_modules $BUILD_ENV 2025-05-07T20:29:49.9537365Z . $PRELUDE; test_all_fbgemm_gpu_modules $BUILD_ENV 2025-05-07T20:29:49.9553990Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:29:49.9554339Z env: 2025-05-07T20:29:49.9554563Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:29:49.9554859Z BUILD_ENV: build_binary 2025-05-07T20:29:49.9555106Z BUILD_TARGET: genai 2025-05-07T20:29:49.9555337Z BUILD_VARIANT: cuda 2025-05-07T20:29:49.9555581Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:29:49.9555832Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:29:49.9556130Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:29:49.9556459Z ##[endgroup] 2025-05-07T20:29:50.2941325Z ################################################################################ 2025-05-07T20:29:50.2941724Z # Test All FBGEMM-GPU Modules 2025-05-07T20:29:50.2941984Z # 2025-05-07T20:29:50.2956970Z # [2025-05-07T20:29:50.295Z] + test_all_fbgemm_gpu_modules build_binary 2025-05-07T20:29:50.2957376Z ################################################################################ 2025-05-07T20:29:50.2957590Z 2025-05-07T20:29:58.1654001Z [TEST] Determined FBGEMM_GPU (target : variant) from installation: (genai : cuda) 2025-05-07T20:29:58.1654579Z [TEST] Will be running tests specific to this target and variant ... 2025-05-07T20:29:58.1654971Z [TEST] Determined the test directories: 2025-05-07T20:29:58.1655283Z fbgemm_gpu/experimental/gen_ai/test 2025-05-07T20:29:58.1655587Z fbgemm_gpu/experimental/example/test 2025-05-07T20:29:58.1655878Z fbgemm_gpu/experimental/gemm/test 2025-05-07T20:29:58.1656068Z 2025-05-07T20:29:58.1660941Z [TEST] FBGEMM_GPU variant is cuda; configuring for CUDA-based testing ... 2025-05-07T20:29:58.1670062Z [TEST] Set environment variables for CUDA testing ... 2025-05-07T20:29:58.1670525Z + conda env config vars unset -n build_binary CUDA_VISIBLE_DEVICES 2025-05-07T20:29:58.1670804Z 2025-05-07T20:29:58.5888056Z 2025-05-07T20:29:58.5888360Z [TEST] Installing PyTest ... 2025-05-07T20:29:58.5913168Z [EXEC] [ATTEMPT 0/3] + conda install -n build_binary -c conda-forge --override-channels -y pytest expecttest 2025-05-07T20:29:59.6898073Z Channels: 2025-05-07T20:29:59.6898334Z - conda-forge 2025-05-07T20:29:59.6898566Z Platform: linux-64 2025-05-07T20:30:02.9768227Z Collecting package metadata (repodata.json): - \ | / done 2025-05-07T20:30:04.1237616Z Solving environment: \ | / done 2025-05-07T20:30:04.3486179Z 2025-05-07T20:30:04.3486794Z ## Package Plan ## 2025-05-07T20:30:04.3486975Z 2025-05-07T20:30:04.3487183Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:30:04.3487486Z 2025-05-07T20:30:04.3487584Z added / updated specs: 2025-05-07T20:30:04.3487832Z - expecttest 2025-05-07T20:30:04.3488087Z - pytest 2025-05-07T20:30:04.3488210Z 2025-05-07T20:30:04.3488215Z 2025-05-07T20:30:04.3488336Z The following packages will be downloaded: 2025-05-07T20:30:04.3488565Z 2025-05-07T20:30:04.3488680Z package | build 2025-05-07T20:30:04.3489000Z ---------------------------|----------------- 2025-05-07T20:30:04.3489366Z colorama-0.4.6 | pyhd8ed1ab_1 26 KB conda-forge 2025-05-07T20:30:04.3490075Z exceptiongroup-1.2.2 | pyhd8ed1ab_1 20 KB conda-forge 2025-05-07T20:30:04.3490565Z expecttest-0.3.0 | pyhd8ed1ab_0 14 KB conda-forge 2025-05-07T20:30:04.3491002Z iniconfig-2.0.0 | pyhd8ed1ab_1 11 KB conda-forge 2025-05-07T20:30:04.3491426Z packaging-25.0 | pyh29332c3_1 61 KB conda-forge 2025-05-07T20:30:04.3491847Z pluggy-1.5.0 | pyhd8ed1ab_1 23 KB conda-forge 2025-05-07T20:30:04.3492253Z pytest-8.3.5 | pyhd8ed1ab_0 254 KB conda-forge 2025-05-07T20:30:04.3493202Z tomli-2.2.1 | pyhd8ed1ab_1 19 KB conda-forge 2025-05-07T20:30:04.3493591Z ------------------------------------------------------------ 2025-05-07T20:30:04.3493935Z Total: 428 KB 2025-05-07T20:30:04.3494140Z 2025-05-07T20:30:04.3494277Z The following NEW packages will be INSTALLED: 2025-05-07T20:30:04.3494491Z 2025-05-07T20:30:04.3494693Z colorama conda-forge/noarch::colorama-0.4.6-pyhd8ed1ab_1 2025-05-07T20:30:04.3495198Z exceptiongroup conda-forge/noarch::exceptiongroup-1.2.2-pyhd8ed1ab_1 2025-05-07T20:30:04.3495719Z expecttest conda-forge/noarch::expecttest-0.3.0-pyhd8ed1ab_0 2025-05-07T20:30:04.3496189Z iniconfig conda-forge/noarch::iniconfig-2.0.0-pyhd8ed1ab_1 2025-05-07T20:30:04.3496645Z packaging conda-forge/noarch::packaging-25.0-pyh29332c3_1 2025-05-07T20:30:04.3497103Z pluggy conda-forge/noarch::pluggy-1.5.0-pyhd8ed1ab_1 2025-05-07T20:30:04.3497532Z pytest conda-forge/noarch::pytest-8.3.5-pyhd8ed1ab_0 2025-05-07T20:30:04.3497960Z tomli conda-forge/noarch::tomli-2.2.1-pyhd8ed1ab_1 2025-05-07T20:30:04.3498214Z 2025-05-07T20:30:04.3498218Z 2025-05-07T20:30:04.3498223Z 2025-05-07T20:30:04.3498367Z Downloading and Extracting Packages: ...working... 2025-05-07T20:30:04.3498738Z pytest-8.3.5 | 254 KB | | 0% 2025-05-07T20:30:04.3498966Z 2025-05-07T20:30:04.3502414Z packaging-25.0 | 61 KB | | 0%  2025-05-07T20:30:04.3502657Z 2025-05-07T20:30:04.3502661Z 2025-05-07T20:30:04.3516705Z colorama-0.4.6 | 26 KB | | 0%  2025-05-07T20:30:04.3516991Z 2025-05-07T20:30:04.3516997Z 2025-05-07T20:30:04.3517002Z 2025-05-07T20:30:04.3528111Z pluggy-1.5.0 | 23 KB | | 0%  2025-05-07T20:30:04.3528359Z 2025-05-07T20:30:04.3528363Z 2025-05-07T20:30:04.3528387Z 2025-05-07T20:30:04.3528391Z 2025-05-07T20:30:04.3547035Z exceptiongroup-1.2.2 | 20 KB | | 0%  2025-05-07T20:30:04.3547325Z 2025-05-07T20:30:04.3547329Z 2025-05-07T20:30:04.3547333Z 2025-05-07T20:30:04.3547343Z 2025-05-07T20:30:04.3556957Z 2025-05-07T20:30:04.3558455Z tomli-2.2.1 | 19 KB | | 0%  2025-05-07T20:30:04.3558700Z 2025-05-07T20:30:04.3558707Z 2025-05-07T20:30:04.3558718Z 2025-05-07T20:30:04.3558722Z 2025-05-07T20:30:04.3558725Z 2025-05-07T20:30:04.3558774Z 2025-05-07T20:30:04.3561110Z expecttest-0.3.0 | 14 KB | | 0%  2025-05-07T20:30:04.3561404Z 2025-05-07T20:30:04.3561409Z 2025-05-07T20:30:04.3561413Z 2025-05-07T20:30:04.3561417Z 2025-05-07T20:30:04.3561420Z 2025-05-07T20:30:04.3561424Z 2025-05-07T20:30:04.3564721Z 2025-05-07T20:30:04.5633404Z iniconfig-2.0.0 | 11 KB | | 0%  2025-05-07T20:30:04.5643017Z pytest-8.3.5 | 254 KB | 6 | 6% 2025-05-07T20:30:04.5643292Z 2025-05-07T20:30:04.5643297Z 2025-05-07T20:30:04.5643312Z 2025-05-07T20:30:04.5653292Z pluggy-1.5.0 | 23 KB | ######9 | 69%  2025-05-07T20:30:04.5653607Z 2025-05-07T20:30:04.5653614Z 2025-05-07T20:30:04.5655353Z 2025-05-07T20:30:04.5722451Z pluggy-1.5.0 | 23 KB | ########## | 100%  2025-05-07T20:30:04.5722705Z 2025-05-07T20:30:04.5722709Z 2025-05-07T20:30:04.5722713Z 2025-05-07T20:30:04.5722717Z 2025-05-07T20:30:04.5731193Z exceptiongroup-1.2.2 | 20 KB | #######9 | 80%  2025-05-07T20:30:04.5731622Z 2025-05-07T20:30:04.5731626Z 2025-05-07T20:30:04.5731630Z 2025-05-07T20:30:04.5732666Z 2025-05-07T20:30:04.5792234Z exceptiongroup-1.2.2 | 20 KB | ########## | 100%  2025-05-07T20:30:04.5955707Z pytest-8.3.5 | 254 KB | ########## | 100% 2025-05-07T20:30:04.5955947Z 2025-05-07T20:30:04.5958979Z 2025-05-07T20:30:04.5998982Z colorama-0.4.6 | 26 KB | ###### | 61%  2025-05-07T20:30:04.5999454Z 2025-05-07T20:30:04.6001285Z 2025-05-07T20:30:04.6008804Z colorama-0.4.6 | 26 KB | ########## | 100%  2025-05-07T20:30:04.6009175Z 2025-05-07T20:30:04.6009180Z 2025-05-07T20:30:04.6009184Z 2025-05-07T20:30:04.6049964Z pluggy-1.5.0 | 23 KB | ########## | 100%  2025-05-07T20:30:04.6050297Z 2025-05-07T20:30:04.6050302Z 2025-05-07T20:30:04.6050307Z 2025-05-07T20:30:04.6050312Z 2025-05-07T20:30:04.6050905Z 2025-05-07T20:30:04.6082053Z tomli-2.2.1 | 19 KB | ########5 | 85%  2025-05-07T20:30:04.6082425Z 2025-05-07T20:30:04.6082430Z 2025-05-07T20:30:04.6082435Z 2025-05-07T20:30:04.6082438Z 2025-05-07T20:30:04.6082917Z 2025-05-07T20:30:04.6108120Z tomli-2.2.1 | 19 KB | ########## | 100%  2025-05-07T20:30:04.6108686Z 2025-05-07T20:30:04.6139325Z packaging-25.0 | 61 KB | ##6 | 26%  2025-05-07T20:30:04.6139684Z 2025-05-07T20:30:04.6139688Z 2025-05-07T20:30:04.6139692Z 2025-05-07T20:30:04.6142823Z 2025-05-07T20:30:04.6162073Z exceptiongroup-1.2.2 | 20 KB | ########## | 100%  2025-05-07T20:30:04.6162863Z 2025-05-07T20:30:04.6275361Z packaging-25.0 | 61 KB | ########## | 100%  2025-05-07T20:30:04.6275814Z 2025-05-07T20:30:04.6275821Z 2025-05-07T20:30:04.6275826Z 2025-05-07T20:30:04.6275832Z 2025-05-07T20:30:04.6275837Z 2025-05-07T20:30:04.6275843Z 2025-05-07T20:30:04.6276165Z 2025-05-07T20:30:04.6297307Z iniconfig-2.0.0 | 11 KB | ########## | 100%  2025-05-07T20:30:04.6297709Z 2025-05-07T20:30:04.6297713Z 2025-05-07T20:30:04.6297717Z 2025-05-07T20:30:04.6297730Z 2025-05-07T20:30:04.6297734Z 2025-05-07T20:30:04.6297737Z 2025-05-07T20:30:04.6297741Z 2025-05-07T20:30:04.6429043Z iniconfig-2.0.0 | 11 KB | ########## | 100%  2025-05-07T20:30:04.6429322Z 2025-05-07T20:30:04.6430798Z 2025-05-07T20:30:04.6525579Z colorama-0.4.6 | 26 KB | ########## | 100%  2025-05-07T20:30:04.6525866Z 2025-05-07T20:30:04.6525874Z 2025-05-07T20:30:04.6525880Z 2025-05-07T20:30:04.6525911Z 2025-05-07T20:30:04.6526030Z 2025-05-07T20:30:04.6619351Z tomli-2.2.1 | 19 KB | ########## | 100%  2025-05-07T20:30:04.6619622Z 2025-05-07T20:30:04.6619627Z 2025-05-07T20:30:04.6619631Z 2025-05-07T20:30:04.6619634Z 2025-05-07T20:30:04.6619645Z 2025-05-07T20:30:04.6619649Z 2025-05-07T20:30:04.6633880Z expecttest-0.3.0 | 14 KB | ########## | 100%  2025-05-07T20:30:04.6634161Z 2025-05-07T20:30:04.6634165Z 2025-05-07T20:30:04.6634169Z 2025-05-07T20:30:04.6634180Z 2025-05-07T20:30:04.6634184Z 2025-05-07T20:30:04.6638703Z 2025-05-07T20:30:04.6758392Z expecttest-0.3.0 | 14 KB | ########## | 100%  2025-05-07T20:30:04.6758740Z 2025-05-07T20:30:04.6877534Z packaging-25.0 | 61 KB | ########## | 100%  2025-05-07T20:30:04.6877896Z 2025-05-07T20:30:04.6877913Z 2025-05-07T20:30:04.6877917Z 2025-05-07T20:30:04.6877920Z 2025-05-07T20:30:04.6877924Z 2025-05-07T20:30:04.6877928Z 2025-05-07T20:30:04.6878306Z 2025-05-07T20:30:04.6948388Z iniconfig-2.0.0 | 11 KB | ########## | 100%  2025-05-07T20:30:04.6948783Z 2025-05-07T20:30:04.6948787Z 2025-05-07T20:30:04.6948791Z 2025-05-07T20:30:04.6948802Z 2025-05-07T20:30:04.6948806Z 2025-05-07T20:30:04.6948809Z 2025-05-07T20:30:04.6959422Z expecttest-0.3.0 | 14 KB | ########## | 100%  2025-05-07T20:30:04.6959968Z pytest-8.3.5 | 254 KB | ########## | 100% 2025-05-07T20:30:04.6967148Z pytest-8.3.5 | 254 KB | ########## | 100% 2025-05-07T20:30:04.6967491Z 2025-05-07T20:30:04.6967697Z 2025-05-07T20:30:04.6967900Z  2025-05-07T20:30:04.6968129Z 2025-05-07T20:30:04.6968134Z 2025-05-07T20:30:04.6968299Z  2025-05-07T20:30:04.6968731Z 2025-05-07T20:30:04.6968736Z 2025-05-07T20:30:04.6968740Z 2025-05-07T20:30:04.6969049Z  2025-05-07T20:30:04.6969263Z 2025-05-07T20:30:04.6969267Z 2025-05-07T20:30:04.6969271Z 2025-05-07T20:30:04.6969274Z 2025-05-07T20:30:04.6969467Z  2025-05-07T20:30:04.6969673Z 2025-05-07T20:30:04.6969677Z 2025-05-07T20:30:04.6969680Z 2025-05-07T20:30:04.6969684Z 2025-05-07T20:30:04.6969688Z 2025-05-07T20:30:04.6969872Z  2025-05-07T20:30:04.6970081Z 2025-05-07T20:30:04.6970085Z 2025-05-07T20:30:04.6970089Z 2025-05-07T20:30:04.6970093Z 2025-05-07T20:30:04.6970096Z 2025-05-07T20:30:04.6970100Z 2025-05-07T20:30:04.6970286Z  2025-05-07T20:30:04.6970497Z 2025-05-07T20:30:04.6970507Z 2025-05-07T20:30:04.6970511Z 2025-05-07T20:30:04.6970515Z 2025-05-07T20:30:04.6970518Z 2025-05-07T20:30:04.6970522Z 2025-05-07T20:30:04.6970526Z 2025-05-07T20:30:04.6970731Z  done 2025-05-07T20:30:04.7973328Z Preparing transaction: \ done 2025-05-07T20:30:04.8980265Z Verifying transaction: / done 2025-05-07T20:30:06.7007509Z Executing transaction: \ | / - \ | / - \ | / - \ | / - \ | done 2025-05-07T20:30:06.8297146Z [TEST] Checking imports ... 2025-05-07T20:30:10.7292616Z [CHECK] Python (sub-)package 'fbgemm_gpu' found ... 2025-05-07T20:30:10.7307265Z [TEST] Setting feature flags ... 2025-05-07T20:30:10.7307890Z + conda env config vars set -n build_binary FBGEMM_TBE_ENSEMBLE_ROWWISE_ADAGRAD=1 2025-05-07T20:30:10.7308288Z 2025-05-07T20:30:11.1571652Z 2025-05-07T20:30:11.1572160Z [TEST] PyTest args: -v -rsx -s -W ignore::pytest.PytestCollectionWarning 2025-05-07T20:30:11.1583728Z ################################################################################ 2025-05-07T20:30:11.1584142Z # Run FBGEMM-GPU Tests: 2025-05-07T20:30:11.1584398Z # 2025-05-07T20:30:11.1591483Z # [2025-05-07T20:30:11.158Z] + __run_fbgemm_gpu_tests_in_directory build_binary 2025-05-07T20:30:11.1591914Z ################################################################################ 2025-05-07T20:30:11.1592139Z 2025-05-07T20:30:11.1598592Z [TEST] Enumerating ALL test files ... 2025-05-07T20:30:11.1629834Z ./attention/gqa_test.py 2025-05-07T20:30:11.1630116Z ./coalesce/coalesce_test.py 2025-05-07T20:30:11.1630383Z ./comm/multi_gpu_car_test.py 2025-05-07T20:30:11.1630664Z ./gather_scatter/gather_scatter_test.py 2025-05-07T20:30:11.1630967Z ./kv_cache/kv_cache_test.py 2025-05-07T20:30:11.1631219Z ./moe/activation_test.py 2025-05-07T20:30:11.1631475Z ./moe/gather_scatter_test.py 2025-05-07T20:30:11.1631731Z ./moe/layers_test.py 2025-05-07T20:30:11.1631967Z ./moe/shuffling_test.py 2025-05-07T20:30:11.1632209Z ./quantize/quantize_test.py 2025-05-07T20:30:11.1632389Z 2025-05-07T20:30:11.1632506Z [TEST] Enumerating IGNORED test files ... 2025-05-07T20:30:11.1632714Z 2025-05-07T20:30:11.1650130Z ################################################################################ 2025-05-07T20:30:11.1665362Z # [2025-05-07T20:30:11.166Z] Run Python Test Suite: 2025-05-07T20:30:11.1665697Z # ./attention/gqa_test.py 2025-05-07T20:30:11.1665979Z ################################################################################ 2025-05-07T20:30:11.1689677Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./attention/gqa_test.py 2025-05-07T20:30:11.1690512Z 2025-05-07T20:30:13.6971393Z ============================= test session starts ============================== 2025-05-07T20:30:13.6972057Z platform linux -- Python 3.10.13, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:30:13.6972590Z cachedir: .pytest_cache 2025-05-07T20:30:13.6973708Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:30:13.6974432Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:30:13.6974841Z plugins: hypothesis-6.131.14 2025-05-07T20:30:15.2174651Z collecting ... collected 2 items 2025-05-07T20:30:15.2174863Z 2025-05-07T20:30:52.0340631Z attention/gqa_test.py::Int4GQATest::test_gqa Trying example: test_gqa( 2025-05-07T20:30:52.0343376Z self=, 2025-05-07T20:30:52.0343800Z int4_kv=False, 2025-05-07T20:30:52.0344114Z num_groups=1, 2025-05-07T20:30:52.0344381Z B=1, 2025-05-07T20:30:52.0344619Z MAX_T=4, 2025-05-07T20:30:52.0344856Z N_H_L=1, 2025-05-07T20:30:52.0345105Z ) 2025-05-07T20:30:52.0345355Z Trying example: test_gqa( 2025-05-07T20:30:52.0345708Z self=, 2025-05-07T20:30:52.0346133Z int4_kv=True, 2025-05-07T20:30:52.0346400Z num_groups=1, 2025-05-07T20:30:52.0346649Z B=1, 2025-05-07T20:30:52.0346882Z MAX_T=4, 2025-05-07T20:30:52.0347139Z N_H_L=1, 2025-05-07T20:30:52.0347370Z ) 2025-05-07T20:30:52.0347617Z Trying example: test_gqa( 2025-05-07T20:30:52.0347979Z self=, 2025-05-07T20:30:52.0348352Z int4_kv=True, 2025-05-07T20:30:52.0348612Z num_groups=4, 2025-05-07T20:30:52.0348868Z B=23, 2025-05-07T20:30:52.0349110Z MAX_T=33, 2025-05-07T20:30:52.0349368Z N_H_L=68, 2025-05-07T20:30:52.0349625Z ) 2025-05-07T20:30:52.0349916Z Trying example: test_gqa( 2025-05-07T20:30:52.0350268Z self=, 2025-05-07T20:30:52.0350639Z int4_kv=True, 2025-05-07T20:30:52.0350894Z num_groups=4, 2025-05-07T20:30:52.0351145Z B=77, 2025-05-07T20:30:52.0351363Z MAX_T=4, 2025-05-07T20:30:52.0351611Z N_H_L=1, 2025-05-07T20:30:52.0351846Z ) 2025-05-07T20:30:52.0352073Z Trying example: test_gqa( 2025-05-07T20:30:52.0352431Z self=, 2025-05-07T20:30:52.0352812Z int4_kv=True, 2025-05-07T20:30:52.0353067Z num_groups=4, 2025-05-07T20:30:52.0353322Z B=77, 2025-05-07T20:30:52.0353547Z MAX_T=52, 2025-05-07T20:30:52.0353776Z N_H_L=67, 2025-05-07T20:30:52.0354012Z ) 2025-05-07T20:30:52.0354322Z Trying example: test_gqa( 2025-05-07T20:30:52.0354676Z self=, 2025-05-07T20:30:52.0355046Z int4_kv=False, 2025-05-07T20:30:52.0355307Z num_groups=4, 2025-05-07T20:30:52.0355555Z B=57, 2025-05-07T20:30:52.0355773Z MAX_T=45, 2025-05-07T20:30:52.0356015Z N_H_L=120, 2025-05-07T20:30:52.0356254Z ) 2025-05-07T20:30:52.0356483Z Trying example: test_gqa( 2025-05-07T20:30:52.0356833Z self=, 2025-05-07T20:30:52.0357211Z int4_kv=True, 2025-05-07T20:30:52.0357458Z num_groups=4, 2025-05-07T20:30:52.0357704Z B=52, 2025-05-07T20:30:52.0357939Z MAX_T=42, 2025-05-07T20:30:52.0358166Z N_H_L=53, 2025-05-07T20:30:52.0358397Z ) 2025-05-07T20:30:52.0358631Z Trying example: test_gqa( 2025-05-07T20:30:52.0358984Z self=, 2025-05-07T20:30:52.0359363Z int4_kv=True, 2025-05-07T20:30:52.0359617Z num_groups=1, 2025-05-07T20:30:52.0359857Z B=77, 2025-05-07T20:30:52.0360086Z MAX_T=95, 2025-05-07T20:30:52.0360322Z N_H_L=53, 2025-05-07T20:30:52.0360548Z ) 2025-05-07T20:30:52.0360786Z Trying example: test_gqa( 2025-05-07T20:30:52.0361137Z self=, 2025-05-07T20:30:52.0361518Z int4_kv=True, 2025-05-07T20:30:52.0361764Z num_groups=4, 2025-05-07T20:30:52.0362016Z B=113, 2025-05-07T20:30:52.0362244Z MAX_T=48, 2025-05-07T20:30:52.0362476Z N_H_L=96, 2025-05-07T20:30:52.0362709Z ) 2025-05-07T20:30:52.0362940Z Trying example: test_gqa( 2025-05-07T20:30:52.0363282Z self=, 2025-05-07T20:30:52.0364083Z int4_kv=False, 2025-05-07T20:30:52.0364345Z num_groups=1, 2025-05-07T20:30:52.0364586Z B=51, 2025-05-07T20:30:52.0365001Z MAX_T=61, 2025-05-07T20:30:52.0365249Z N_H_L=69, 2025-05-07T20:30:52.0365473Z ) 2025-05-07T20:30:52.0365708Z Trying example: test_gqa( 2025-05-07T20:30:52.0366056Z self=, 2025-05-07T20:30:52.0366427Z int4_kv=False, 2025-05-07T20:30:52.0366686Z num_groups=4, 2025-05-07T20:30:52.0366933Z B=17, 2025-05-07T20:30:52.0367155Z MAX_T=113, 2025-05-07T20:30:52.0367399Z N_H_L=65, 2025-05-07T20:30:52.0367629Z ) 2025-05-07T20:30:52.0367856Z Trying example: test_gqa( 2025-05-07T20:30:52.0368205Z self=, 2025-05-07T20:30:52.0368582Z int4_kv=False, 2025-05-07T20:30:52.0368831Z num_groups=4, 2025-05-07T20:30:52.0369082Z B=17, 2025-05-07T20:30:52.0369310Z MAX_T=65, 2025-05-07T20:30:52.0369540Z N_H_L=65, 2025-05-07T20:30:52.0369777Z ) 2025-05-07T20:30:52.0370048Z Trying example: test_gqa( 2025-05-07T20:30:52.0370415Z self=, 2025-05-07T20:30:52.0370807Z int4_kv=False, 2025-05-07T20:30:52.0371064Z num_groups=4, 2025-05-07T20:30:52.0371309Z B=65, 2025-05-07T20:30:52.0371530Z MAX_T=65, 2025-05-07T20:30:52.0371767Z N_H_L=65, 2025-05-07T20:30:52.0371997Z ) 2025-05-07T20:30:52.0372224Z Trying example: test_gqa( 2025-05-07T20:30:52.0372577Z self=, 2025-05-07T20:30:52.0372954Z int4_kv=False, 2025-05-07T20:30:52.0373202Z num_groups=1, 2025-05-07T20:30:52.0373450Z B=6, 2025-05-07T20:30:52.0373678Z MAX_T=108, 2025-05-07T20:30:52.0373915Z N_H_L=14, 2025-05-07T20:30:52.0374149Z ) 2025-05-07T20:30:52.0374388Z Trying example: test_gqa( 2025-05-07T20:30:52.0374728Z self=, 2025-05-07T20:30:52.0375106Z int4_kv=False, 2025-05-07T20:30:52.0375364Z num_groups=1, 2025-05-07T20:30:52.0375611Z B=6, 2025-05-07T20:30:52.0375840Z MAX_T=14, 2025-05-07T20:30:52.0376081Z N_H_L=14, 2025-05-07T20:30:52.0376308Z ) 2025-05-07T20:30:52.0376553Z Trying example: test_gqa( 2025-05-07T20:30:52.0376905Z self=, 2025-05-07T20:30:52.0377277Z int4_kv=False, 2025-05-07T20:30:52.0377532Z num_groups=1, 2025-05-07T20:30:52.0377779Z B=6, 2025-05-07T20:30:52.0377997Z MAX_T=6, 2025-05-07T20:30:52.0378230Z N_H_L=14, 2025-05-07T20:30:52.0378462Z ) 2025-05-07T20:30:52.0378690Z Trying example: test_gqa( 2025-05-07T20:30:52.0379040Z self=, 2025-05-07T20:30:52.0379418Z int4_kv=False, 2025-05-07T20:30:52.0379675Z num_groups=1, 2025-05-07T20:30:52.0380052Z B=6, 2025-05-07T20:30:52.0380282Z MAX_T=6, 2025-05-07T20:30:52.0380515Z N_H_L=6, 2025-05-07T20:30:52.0380738Z ) 2025-05-07T20:30:52.0380974Z Trying example: test_gqa( 2025-05-07T20:30:52.0381321Z self=, 2025-05-07T20:30:52.0381703Z int4_kv=False, 2025-05-07T20:30:52.0381965Z num_groups=1, 2025-05-07T20:30:52.0382211Z B=70, 2025-05-07T20:30:52.0382441Z MAX_T=94, 2025-05-07T20:30:52.0382678Z N_H_L=78, 2025-05-07T20:30:52.0382912Z ) 2025-05-07T20:30:52.0383142Z Trying example: test_gqa( 2025-05-07T20:30:52.0383491Z self=, 2025-05-07T20:30:52.0383868Z int4_kv=False, 2025-05-07T20:30:52.0384120Z num_groups=1, 2025-05-07T20:30:52.0384367Z B=78, 2025-05-07T20:30:52.0384594Z MAX_T=94, 2025-05-07T20:30:52.0384825Z N_H_L=78, 2025-05-07T20:30:52.0385058Z ) 2025-05-07T20:30:52.0385291Z Trying example: test_gqa( 2025-05-07T20:30:52.0385632Z self=, 2025-05-07T20:30:52.0386011Z int4_kv=False, 2025-05-07T20:30:52.0386267Z num_groups=1, 2025-05-07T20:30:52.0386510Z B=94, 2025-05-07T20:30:52.0386740Z MAX_T=94, 2025-05-07T20:30:52.0386975Z N_H_L=78, 2025-05-07T20:30:52.0387965Z ) 2025-05-07T20:30:52.0388206Z Trying example: test_gqa( 2025-05-07T20:30:52.0388559Z self=, 2025-05-07T20:30:52.0389031Z int4_kv=False, 2025-05-07T20:30:52.0389289Z num_groups=1, 2025-05-07T20:30:52.0389550Z B=94, 2025-05-07T20:30:52.0389790Z MAX_T=94, 2025-05-07T20:30:52.0390247Z N_H_L=94, 2025-05-07T20:30:52.0390440Z ) 2025-05-07T20:30:52.0390633Z Trying example: test_gqa( 2025-05-07T20:30:52.0390919Z self=, 2025-05-07T20:30:52.0391228Z int4_kv=False, 2025-05-07T20:30:52.0391438Z num_groups=4, 2025-05-07T20:30:52.0391634Z B=41, 2025-05-07T20:30:52.0391822Z MAX_T=105, 2025-05-07T20:30:52.0392023Z N_H_L=126, 2025-05-07T20:30:52.0392211Z ) 2025-05-07T20:30:52.0392410Z Trying example: test_gqa( 2025-05-07T20:30:52.0392697Z self=, 2025-05-07T20:30:52.0392998Z int4_kv=False, 2025-05-07T20:30:52.0393206Z num_groups=4, 2025-05-07T20:30:52.0393419Z B=105, 2025-05-07T20:30:52.0393600Z MAX_T=105, 2025-05-07T20:30:52.0393801Z N_H_L=126, 2025-05-07T20:30:52.0394000Z ) 2025-05-07T20:30:52.0394187Z Trying example: test_gqa( 2025-05-07T20:30:52.0394476Z self=, 2025-05-07T20:30:52.0394789Z int4_kv=False, 2025-05-07T20:30:52.0394990Z num_groups=4, 2025-05-07T20:30:52.0395192Z B=105, 2025-05-07T20:30:52.0395379Z MAX_T=105, 2025-05-07T20:30:52.0395574Z N_H_L=105, 2025-05-07T20:30:52.0395774Z ) 2025-05-07T20:30:52.0395970Z Trying example: test_gqa( 2025-05-07T20:30:52.0396253Z self=, 2025-05-07T20:30:52.0396562Z int4_kv=True, 2025-05-07T20:30:52.0396767Z num_groups=1, 2025-05-07T20:30:52.0396967Z B=95, 2025-05-07T20:30:52.0397148Z MAX_T=114, 2025-05-07T20:30:52.0397345Z N_H_L=43, 2025-05-07T20:30:52.0397532Z ) 2025-05-07T20:30:52.0397721Z Trying example: test_gqa( 2025-05-07T20:30:52.0398019Z self=, 2025-05-07T20:30:52.0398322Z int4_kv=True, 2025-05-07T20:30:52.0398530Z num_groups=1, 2025-05-07T20:30:52.0398739Z B=43, 2025-05-07T20:30:52.0398929Z MAX_T=114, 2025-05-07T20:30:52.0399123Z N_H_L=43, 2025-05-07T20:30:52.0399316Z ) 2025-05-07T20:30:52.0399513Z Trying example: test_gqa( 2025-05-07T20:30:52.0399802Z self=, 2025-05-07T20:30:52.0400111Z int4_kv=True, 2025-05-07T20:30:52.0400319Z num_groups=1, 2025-05-07T20:30:52.0400514Z B=43, 2025-05-07T20:30:52.0400705Z MAX_T=43, 2025-05-07T20:30:52.0400897Z N_H_L=43, 2025-05-07T20:30:52.0401083Z ) 2025-05-07T20:30:52.0401275Z Trying example: test_gqa( 2025-05-07T20:30:52.0401566Z self=, 2025-05-07T20:30:52.0401868Z int4_kv=False, 2025-05-07T20:30:52.0402076Z num_groups=1, 2025-05-07T20:30:52.0402278Z B=21, 2025-05-07T20:30:52.0402460Z MAX_T=38, 2025-05-07T20:30:52.0402658Z N_H_L=42, 2025-05-07T20:30:52.0402851Z ) 2025-05-07T20:30:52.0403049Z Trying example: test_gqa( 2025-05-07T20:30:52.0403339Z self=, 2025-05-07T20:30:52.0403655Z int4_kv=False, 2025-05-07T20:30:52.0403864Z num_groups=1, 2025-05-07T20:30:52.0404060Z B=38, 2025-05-07T20:30:52.0404246Z MAX_T=38, 2025-05-07T20:30:52.0404441Z N_H_L=42, 2025-05-07T20:30:52.0404622Z ) 2025-05-07T20:30:52.0404816Z Trying example: test_gqa( 2025-05-07T20:30:52.0405105Z self=, 2025-05-07T20:30:52.0405405Z int4_kv=False, 2025-05-07T20:30:52.0405614Z num_groups=1, 2025-05-07T20:30:52.0405816Z B=38, 2025-05-07T20:30:52.0405998Z MAX_T=42, 2025-05-07T20:30:52.0406194Z N_H_L=42, 2025-05-07T20:30:52.0406385Z ) 2025-05-07T20:30:52.0406571Z Trying example: test_gqa( 2025-05-07T20:30:52.0406871Z self=, 2025-05-07T20:30:52.0407186Z int4_kv=False, 2025-05-07T20:30:52.0407599Z num_groups=1, 2025-05-07T20:30:52.0407811Z B=42, 2025-05-07T20:30:52.0408004Z MAX_T=42, 2025-05-07T20:30:52.0408321Z N_H_L=42, 2025-05-07T20:30:52.0408523Z ) 2025-05-07T20:30:52.0408725Z Trying example: test_gqa( 2025-05-07T20:30:52.0409016Z self=, 2025-05-07T20:30:52.0409332Z int4_kv=True, 2025-05-07T20:30:52.0409550Z num_groups=1, 2025-05-07T20:30:52.0409760Z B=74, 2025-05-07T20:30:52.0409947Z MAX_T=20, 2025-05-07T20:30:52.0410147Z N_H_L=15, 2025-05-07T20:30:52.0410344Z ) 2025-05-07T20:30:52.0410539Z Trying example: test_gqa( 2025-05-07T20:30:52.0410837Z self=, 2025-05-07T20:30:52.0411153Z int4_kv=True, 2025-05-07T20:30:52.0411362Z num_groups=1, 2025-05-07T20:30:52.0411570Z B=20, 2025-05-07T20:30:52.0411763Z MAX_T=20, 2025-05-07T20:30:52.0411955Z N_H_L=15, 2025-05-07T20:30:52.0412152Z ) 2025-05-07T20:30:52.0412348Z Trying example: test_gqa( 2025-05-07T20:30:52.0412646Z self=, 2025-05-07T20:30:52.0412958Z int4_kv=True, 2025-05-07T20:30:52.0413174Z num_groups=1, 2025-05-07T20:30:52.0413379Z B=20, 2025-05-07T20:30:52.0413571Z MAX_T=15, 2025-05-07T20:30:52.0413772Z N_H_L=15, 2025-05-07T20:30:52.0413961Z ) 2025-05-07T20:30:52.0414158Z Trying example: test_gqa( 2025-05-07T20:30:52.0414454Z self=, 2025-05-07T20:30:52.0414761Z int4_kv=True, 2025-05-07T20:30:52.0414974Z num_groups=1, 2025-05-07T20:30:52.0415181Z B=15, 2025-05-07T20:30:52.0415365Z MAX_T=20, 2025-05-07T20:30:52.0415564Z N_H_L=15, 2025-05-07T20:30:52.0415758Z ) 2025-05-07T20:30:52.0415951Z Trying example: test_gqa( 2025-05-07T20:30:52.0416251Z self=, 2025-05-07T20:30:52.0416562Z int4_kv=True, 2025-05-07T20:30:52.0416779Z num_groups=1, 2025-05-07T20:30:52.0416980Z B=15, 2025-05-07T20:30:52.0417180Z MAX_T=15, 2025-05-07T20:30:52.0417382Z N_H_L=15, 2025-05-07T20:30:52.0417572Z ) 2025-05-07T20:30:52.0417769Z Trying example: test_gqa( 2025-05-07T20:30:52.0418073Z self=, 2025-05-07T20:30:52.0418383Z int4_kv=False, 2025-05-07T20:30:52.0418600Z num_groups=4, 2025-05-07T20:30:52.0418811Z B=117, 2025-05-07T20:30:52.0419001Z MAX_T=104, 2025-05-07T20:30:52.0419213Z N_H_L=69, 2025-05-07T20:30:52.0419412Z ) 2025-05-07T20:30:52.0419604Z Trying example: test_gqa( 2025-05-07T20:30:52.0420045Z self=, 2025-05-07T20:30:52.0420400Z int4_kv=False, 2025-05-07T20:30:52.0420609Z num_groups=4, 2025-05-07T20:30:52.0420820Z B=117, 2025-05-07T20:30:52.0421017Z MAX_T=117, 2025-05-07T20:30:52.0421219Z N_H_L=69, 2025-05-07T20:30:52.0421416Z ) 2025-05-07T20:30:52.0421615Z Trying example: test_gqa( 2025-05-07T20:30:52.0421903Z self=, 2025-05-07T20:30:52.0422228Z int4_kv=False, 2025-05-07T20:30:52.0422442Z num_groups=4, 2025-05-07T20:30:52.0422643Z B=69, 2025-05-07T20:30:52.0422837Z MAX_T=117, 2025-05-07T20:30:52.0423045Z N_H_L=69, 2025-05-07T20:30:52.0423237Z ) 2025-05-07T20:30:52.0423434Z Trying example: test_gqa( 2025-05-07T20:30:52.0423725Z self=, 2025-05-07T20:30:52.0424031Z int4_kv=False, 2025-05-07T20:30:52.0424245Z num_groups=4, 2025-05-07T20:30:52.0424453Z B=117, 2025-05-07T20:30:52.0424645Z MAX_T=69, 2025-05-07T20:30:52.0424838Z N_H_L=69, 2025-05-07T20:30:52.0425037Z ) 2025-05-07T20:30:52.0425231Z PASSED 2025-05-07T20:30:52.0695293Z attention/gqa_test.py::Int4GQATest::test_mqa_main SKIPPED (Skip when...) 2025-05-07T20:30:52.0695633Z 2025-05-07T20:30:52.0695785Z =========================== short test summary info ============================ 2025-05-07T20:30:52.0696496Z SKIPPED [1] ../../../../../../../../miniconda/envs/build_binary/lib/python3.10/unittest/case.py:117: Skip when CUDA is not available or xformers is not available 2025-05-07T20:30:52.0697505Z ======================== 1 passed, 1 skipped in 38.88s ========================= 2025-05-07T20:30:52.7073836Z 2025-05-07T20:30:52.7074656Z [TEST] Python test suite PASSED: ./attention/gqa_test.py 2025-05-07T20:30:52.7094563Z [TEST] Python test time for ./attention/gqa_test.py: 41 seconds 2025-05-07T20:30:52.7094866Z 2025-05-07T20:30:52.7095019Z 2025-05-07T20:30:52.7095025Z 2025-05-07T20:30:52.7095064Z 2025-05-07T20:30:52.7123368Z ################################################################################ 2025-05-07T20:30:52.7130959Z # [2025-05-07T20:30:52.712Z] Run Python Test Suite: 2025-05-07T20:30:52.7131308Z # ./coalesce/coalesce_test.py 2025-05-07T20:30:52.7131607Z ################################################################################ 2025-05-07T20:30:52.7157408Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./coalesce/coalesce_test.py 2025-05-07T20:30:52.7158316Z 2025-05-07T20:30:54.8561674Z ============================= test session starts ============================== 2025-05-07T20:30:54.8562541Z platform linux -- Python 3.10.13, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:30:54.8563071Z cachedir: .pytest_cache 2025-05-07T20:30:54.8563646Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:30:54.8564354Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:30:54.8564771Z plugins: hypothesis-6.131.14 2025-05-07T20:30:56.4267188Z collecting ... collected 1 item 2025-05-07T20:30:56.4267539Z 2025-05-07T20:30:57.1638560Z coalesce/coalesce_test.py::CoalesceTest::test_coalesce_batches PASSED 2025-05-07T20:30:57.1639027Z 2025-05-07T20:30:57.1639228Z ============================== 1 passed in 2.43s =============================== 2025-05-07T20:30:57.7853545Z 2025-05-07T20:30:57.7854255Z [TEST] Python test suite PASSED: ./coalesce/coalesce_test.py 2025-05-07T20:30:57.7873595Z [TEST] Python test time for ./coalesce/coalesce_test.py: 5 seconds 2025-05-07T20:30:57.7874016Z 2025-05-07T20:30:57.7874023Z 2025-05-07T20:30:57.7874028Z 2025-05-07T20:30:57.7874033Z 2025-05-07T20:30:57.7895531Z ################################################################################ 2025-05-07T20:30:57.7911003Z # [2025-05-07T20:30:57.790Z] Run Python Test Suite: 2025-05-07T20:30:57.7911480Z # ./comm/multi_gpu_car_test.py 2025-05-07T20:30:57.7911881Z ################################################################################ 2025-05-07T20:30:57.7935479Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./comm/multi_gpu_car_test.py 2025-05-07T20:30:57.7936212Z 2025-05-07T20:30:59.9290802Z ============================= test session starts ============================== 2025-05-07T20:30:59.9291625Z platform linux -- Python 3.10.13, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:30:59.9292165Z cachedir: .pytest_cache 2025-05-07T20:30:59.9292738Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:30:59.9293457Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:30:59.9293870Z plugins: hypothesis-6.131.14 2025-05-07T20:31:01.5131479Z collecting ... collected 5 items 2025-05-07T20:31:01.5131770Z 2025-05-07T20:31:01.5142365Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_allgather SKIPPED 2025-05-07T20:31:01.5150569Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_allgather_dtype_mismatch SKIPPED 2025-05-07T20:31:01.5158067Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_allreduce SKIPPED 2025-05-07T20:31:01.5165464Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_oneshot_car_stress SKIPPED 2025-05-07T20:31:01.5181343Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_reducescatter SKIPPED 2025-05-07T20:31:01.5181676Z 2025-05-07T20:31:01.5182031Z =========================== short test summary info ============================ 2025-05-07T20:31:01.5182701Z SKIPPED [1] comm/multi_gpu_car_test.py:310: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs 2025-05-07T20:31:01.5183633Z SKIPPED [1] comm/multi_gpu_car_test.py:351: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs 2025-05-07T20:31:01.5184541Z SKIPPED [1] comm/multi_gpu_car_test.py:418: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs 2025-05-07T20:31:01.5185449Z SKIPPED [1] comm/multi_gpu_car_test.py:434: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs 2025-05-07T20:31:01.5186353Z SKIPPED [1] comm/multi_gpu_car_test.py:402: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs 2025-05-07T20:31:01.5186997Z ============================== 5 skipped in 1.71s ============================== 2025-05-07T20:31:02.0714264Z 2025-05-07T20:31:02.0718874Z [TEST] Python test suite PASSED: ./comm/multi_gpu_car_test.py 2025-05-07T20:31:02.0735948Z [TEST] Python test time for ./comm/multi_gpu_car_test.py: 5 seconds 2025-05-07T20:31:02.0736241Z 2025-05-07T20:31:02.0736245Z 2025-05-07T20:31:02.0736261Z 2025-05-07T20:31:02.0736265Z 2025-05-07T20:31:02.0757797Z ################################################################################ 2025-05-07T20:31:02.0773944Z # [2025-05-07T20:31:02.077Z] Run Python Test Suite: 2025-05-07T20:31:02.0774288Z # ./gather_scatter/gather_scatter_test.py 2025-05-07T20:31:02.0774607Z ################################################################################ 2025-05-07T20:31:02.0799181Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./gather_scatter/gather_scatter_test.py 2025-05-07T20:31:02.0799977Z 2025-05-07T20:31:04.2262517Z ============================= test session starts ============================== 2025-05-07T20:31:04.2263183Z platform linux -- Python 3.10.13, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:31:04.2263715Z cachedir: .pytest_cache 2025-05-07T20:31:04.2264308Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:31:04.2265037Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:31:04.2265442Z plugins: hypothesis-6.131.14 2025-05-07T20:31:05.8815250Z collecting ... collected 2 items 2025-05-07T20:31:05.8815464Z 2025-05-07T20:31:05.8827233Z gather_scatter/gather_scatter_test.py::GatherScatterTests::test_gather_along_first_dim SKIPPED 2025-05-07T20:31:05.8841796Z gather_scatter/gather_scatter_test.py::GatherScatterTests::test_scatter_add_along_first_dim SKIPPED 2025-05-07T20:31:05.8842219Z 2025-05-07T20:31:05.8842392Z =========================== short test summary info ============================ 2025-05-07T20:31:05.8843026Z SKIPPED [1] gather_scatter/gather_scatter_test.py:29: Skip when no Hopper GPU is available. This test is only for Hopper GPU. 2025-05-07T20:31:05.8843854Z SKIPPED [1] gather_scatter/gather_scatter_test.py:70: Skip when no Hopper GPU is available. This test is only for Hopper GPU. 2025-05-07T20:31:05.8844443Z ============================== 2 skipped in 1.78s ============================== 2025-05-07T20:31:06.4460044Z 2025-05-07T20:31:06.4461018Z [TEST] Python test suite PASSED: ./gather_scatter/gather_scatter_test.py 2025-05-07T20:31:06.4480670Z [TEST] Python test time for ./gather_scatter/gather_scatter_test.py: 4 seconds 2025-05-07T20:31:06.4480996Z 2025-05-07T20:31:06.4481000Z 2025-05-07T20:31:06.4481004Z 2025-05-07T20:31:06.4481405Z 2025-05-07T20:31:06.4505280Z ################################################################################ 2025-05-07T20:31:06.4521023Z # [2025-05-07T20:31:06.451Z] Run Python Test Suite: 2025-05-07T20:31:06.4521366Z # ./kv_cache/kv_cache_test.py 2025-05-07T20:31:06.4521648Z ################################################################################ 2025-05-07T20:31:06.4546115Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./kv_cache/kv_cache_test.py 2025-05-07T20:31:06.4546730Z 2025-05-07T20:31:08.5923822Z ============================= test session starts ============================== 2025-05-07T20:31:08.5924461Z platform linux -- Python 3.10.13, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:31:08.5925097Z cachedir: .pytest_cache 2025-05-07T20:31:08.5926275Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:31:08.5927722Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:31:08.5928526Z plugins: hypothesis-6.131.14 2025-05-07T20:31:10.1592638Z collecting ... collected 4 items 2025-05-07T20:31:10.1592852Z 2025-05-07T20:31:12.9870380Z kv_cache/kv_cache_test.py::KVCacheTests::test_fp8_kv_cache SKIPPED (...) 2025-05-07T20:31:13.0001564Z kv_cache/kv_cache_test.py::KVCacheTests::test_int4_kv_cache SKIPPED 2025-05-07T20:31:13.0156042Z kv_cache/kv_cache_test.py::KVCacheTests::test_positional_encoding_with_paged_attention SKIPPED 2025-05-07T20:31:13.0287462Z kv_cache/kv_cache_test.py::KVCacheTests::test_rope_positional_encoding_only SKIPPED 2025-05-07T20:31:13.0287818Z 2025-05-07T20:31:13.0287979Z =========================== short test summary info ============================ 2025-05-07T20:31:13.0288679Z SKIPPED [1] ../../../../../../../../miniconda/envs/build_binary/lib/python3.10/unittest/case.py:117: Skip when H100 is not available or MI300 is not available 2025-05-07T20:31:13.0289625Z SKIPPED [3] ../../../../../../../../miniconda/envs/build_binary/lib/python3.10/unittest/case.py:117: Skip when xformers is not available 2025-05-07T20:31:13.0290464Z ============================== 4 skipped in 4.56s ============================== 2025-05-07T20:31:14.9077500Z 2025-05-07T20:31:14.9078509Z [TEST] Python test suite PASSED: ./kv_cache/kv_cache_test.py 2025-05-07T20:31:14.9097909Z [TEST] Python test time for ./kv_cache/kv_cache_test.py: 8 seconds 2025-05-07T20:31:14.9098302Z 2025-05-07T20:31:14.9098313Z 2025-05-07T20:31:14.9098317Z 2025-05-07T20:31:14.9098323Z 2025-05-07T20:31:14.9119763Z ################################################################################ 2025-05-07T20:31:14.9135400Z # [2025-05-07T20:31:14.913Z] Run Python Test Suite: 2025-05-07T20:31:14.9135841Z # ./moe/activation_test.py 2025-05-07T20:31:14.9136213Z ################################################################################ 2025-05-07T20:31:14.9160556Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./moe/activation_test.py 2025-05-07T20:31:14.9161170Z 2025-05-07T20:31:17.0756115Z ============================= test session starts ============================== 2025-05-07T20:31:17.0756758Z platform linux -- Python 3.10.13, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:31:17.0757289Z cachedir: .pytest_cache 2025-05-07T20:31:17.0758055Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:31:17.0759152Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:31:17.0759768Z plugins: hypothesis-6.131.14 2025-05-07T20:31:18.7121040Z TMA benchmarks will be running with experimental grid constant TMA descriptor. 2025-05-07T20:31:18.8897293Z collecting ... collected 2 items 2025-05-07T20:31:18.8897926Z 2025-05-07T20:31:24.3174316Z moe/activation_test.py::ActivationTests::test_silu_mul Trying example: test_silu_mul( 2025-05-07T20:31:24.3175382Z self=, 2025-05-07T20:31:24.3175780Z T=1, 2025-05-07T20:31:24.3175973Z D=5120, 2025-05-07T20:31:24.3176226Z contiguous=True, 2025-05-07T20:31:24.3176557Z compiled=True, 2025-05-07T20:31:24.3176833Z ) 2025-05-07T20:31:24.3177029Z Trying example: test_silu_mul( 2025-05-07T20:31:24.3177477Z self=, 2025-05-07T20:31:24.3178020Z T=4096, 2025-05-07T20:31:24.3178295Z D=5120, 2025-05-07T20:31:24.3178584Z contiguous=True, 2025-05-07T20:31:24.3178893Z compiled=True, 2025-05-07T20:31:24.3179169Z ) 2025-05-07T20:31:24.3179436Z Trying example: test_silu_mul( 2025-05-07T20:31:24.3180082Z self=, 2025-05-07T20:31:24.3180615Z T=4096, 2025-05-07T20:31:24.3180829Z D=7168, 2025-05-07T20:31:24.3181035Z contiguous=False, 2025-05-07T20:31:24.3181263Z compiled=False, 2025-05-07T20:31:24.3181472Z ) 2025-05-07T20:31:24.3181682Z Trying example: test_silu_mul( 2025-05-07T20:31:24.3182059Z self=, 2025-05-07T20:31:24.3182437Z T=4096, 2025-05-07T20:31:24.3182629Z D=5120, 2025-05-07T20:31:24.3182827Z contiguous=False, 2025-05-07T20:31:24.3183053Z compiled=True, 2025-05-07T20:31:24.3183258Z ) 2025-05-07T20:31:24.3183456Z Trying example: test_silu_mul( 2025-05-07T20:31:24.3183820Z self=, 2025-05-07T20:31:24.3184252Z T=1, 2025-05-07T20:31:24.3184528Z D=7168, 2025-05-07T20:31:24.3184801Z contiguous=True, 2025-05-07T20:31:24.3185110Z compiled=True, 2025-05-07T20:31:24.3185394Z ) 2025-05-07T20:31:24.3185621Z Trying example: test_silu_mul( 2025-05-07T20:31:24.3185993Z self=, 2025-05-07T20:31:24.3186374Z T=1, 2025-05-07T20:31:24.3186550Z D=7168, 2025-05-07T20:31:24.3186753Z contiguous=False, 2025-05-07T20:31:24.3186984Z compiled=True, 2025-05-07T20:31:24.3187185Z ) 2025-05-07T20:31:24.3187388Z Trying example: test_silu_mul( 2025-05-07T20:31:24.3187760Z self=, 2025-05-07T20:31:24.3188143Z T=4096, 2025-05-07T20:31:24.3188323Z D=5120, 2025-05-07T20:31:24.3188525Z contiguous=False, 2025-05-07T20:31:24.3188752Z compiled=False, 2025-05-07T20:31:24.3188951Z ) 2025-05-07T20:31:24.3189153Z Trying example: test_silu_mul( 2025-05-07T20:31:24.3189528Z self=, 2025-05-07T20:31:24.3190329Z T=1, 2025-05-07T20:31:24.3190573Z D=7168, 2025-05-07T20:31:24.3190776Z contiguous=True, 2025-05-07T20:31:24.3190995Z compiled=False, 2025-05-07T20:31:24.3191204Z ) 2025-05-07T20:31:24.3191420Z Trying example: test_silu_mul( 2025-05-07T20:31:24.3191789Z self=, 2025-05-07T20:31:24.3192175Z T=2048, 2025-05-07T20:31:24.3192370Z D=5120, 2025-05-07T20:31:24.3192561Z contiguous=True, 2025-05-07T20:31:24.3192787Z compiled=True, 2025-05-07T20:31:24.3192995Z ) 2025-05-07T20:31:24.3193193Z Trying example: test_silu_mul( 2025-05-07T20:31:24.3193566Z self=, 2025-05-07T20:31:24.3193943Z T=2048, 2025-05-07T20:31:24.3194136Z D=7168, 2025-05-07T20:31:24.3194327Z contiguous=True, 2025-05-07T20:31:24.3194551Z compiled=True, 2025-05-07T20:31:24.3194754Z ) 2025-05-07T20:31:24.3194952Z Trying example: test_silu_mul( 2025-05-07T20:31:24.3195323Z self=, 2025-05-07T20:31:24.3195694Z T=2048, 2025-05-07T20:31:24.3195882Z D=7168, 2025-05-07T20:31:24.3196083Z contiguous=True, 2025-05-07T20:31:24.3196546Z compiled=False, 2025-05-07T20:31:24.3196742Z ) 2025-05-07T20:31:24.3196941Z Trying example: test_silu_mul( 2025-05-07T20:31:24.3197446Z self=, 2025-05-07T20:31:24.3197822Z T=128, 2025-05-07T20:31:24.3198006Z D=5120, 2025-05-07T20:31:24.3198205Z contiguous=False, 2025-05-07T20:31:24.3198428Z compiled=True, 2025-05-07T20:31:24.3198632Z ) 2025-05-07T20:31:24.3198835Z Trying example: test_silu_mul( 2025-05-07T20:31:24.3199199Z self=, 2025-05-07T20:31:24.3199574Z T=128, 2025-05-07T20:31:24.3199760Z D=5120, 2025-05-07T20:31:24.3199953Z contiguous=True, 2025-05-07T20:31:24.3200180Z compiled=True, 2025-05-07T20:31:24.3200386Z ) 2025-05-07T20:31:24.3200582Z Trying example: test_silu_mul( 2025-05-07T20:31:24.3200955Z self=, 2025-05-07T20:31:24.3201330Z T=16384, 2025-05-07T20:31:24.3201530Z D=5120, 2025-05-07T20:31:24.3201725Z contiguous=False, 2025-05-07T20:31:24.3201951Z compiled=True, 2025-05-07T20:31:24.3202160Z ) 2025-05-07T20:31:24.3202358Z Trying example: test_silu_mul( 2025-05-07T20:31:24.3202732Z self=, 2025-05-07T20:31:24.3203108Z T=16384, 2025-05-07T20:31:24.3203298Z D=5120, 2025-05-07T20:31:24.3203496Z contiguous=False, 2025-05-07T20:31:24.3203725Z compiled=False, 2025-05-07T20:31:24.3203924Z ) 2025-05-07T20:31:24.3204133Z Trying example: test_silu_mul( 2025-05-07T20:31:24.3204505Z self=, 2025-05-07T20:31:24.3204876Z T=128, 2025-05-07T20:31:24.3205068Z D=7168, 2025-05-07T20:31:24.3205268Z contiguous=True, 2025-05-07T20:31:24.3205487Z compiled=False, 2025-05-07T20:31:24.3205695Z ) 2025-05-07T20:31:24.3205897Z Trying example: test_silu_mul( 2025-05-07T20:31:24.3206267Z self=, 2025-05-07T20:31:24.3206649Z T=128, 2025-05-07T20:31:24.3206839Z D=7168, 2025-05-07T20:31:24.3207047Z contiguous=False, 2025-05-07T20:31:24.3207273Z compiled=False, 2025-05-07T20:31:24.3207481Z ) 2025-05-07T20:31:24.3207684Z Trying example: test_silu_mul( 2025-05-07T20:31:24.3208053Z self=, 2025-05-07T20:31:24.3208428Z T=1, 2025-05-07T20:31:24.3208616Z D=5120, 2025-05-07T20:31:24.3208810Z contiguous=False, 2025-05-07T20:31:24.3209039Z compiled=False, 2025-05-07T20:31:24.3209242Z ) 2025-05-07T20:31:24.3209442Z Trying example: test_silu_mul( 2025-05-07T20:31:24.3209850Z self=, 2025-05-07T20:31:24.3210235Z T=1, 2025-05-07T20:31:24.3210425Z D=7168, 2025-05-07T20:31:24.3210625Z contiguous=False, 2025-05-07T20:31:24.3210848Z compiled=False, 2025-05-07T20:31:24.3211057Z ) 2025-05-07T20:31:24.3211265Z Trying example: test_silu_mul( 2025-05-07T20:31:24.3211636Z self=, 2025-05-07T20:31:24.3212010Z T=4096, 2025-05-07T20:31:24.3212205Z D=5120, 2025-05-07T20:31:24.3212409Z contiguous=True, 2025-05-07T20:31:24.3212630Z compiled=False, 2025-05-07T20:31:24.3212839Z ) 2025-05-07T20:31:24.3213039Z Trying example: test_silu_mul( 2025-05-07T20:31:24.3213405Z self=, 2025-05-07T20:31:24.3213787Z T=128, 2025-05-07T20:31:24.3213972Z D=7168, 2025-05-07T20:31:24.3214164Z contiguous=True, 2025-05-07T20:31:24.3214387Z compiled=True, 2025-05-07T20:31:24.3214592Z ) 2025-05-07T20:31:24.3214789Z Trying example: test_silu_mul( 2025-05-07T20:31:24.3215158Z self=, 2025-05-07T20:31:24.3215538Z T=1, 2025-05-07T20:31:24.3215714Z D=5120, 2025-05-07T20:31:24.3215915Z contiguous=False, 2025-05-07T20:31:24.3216239Z compiled=True, 2025-05-07T20:31:24.3216435Z ) 2025-05-07T20:31:24.3216637Z Trying example: test_silu_mul( 2025-05-07T20:31:24.3217098Z self=, 2025-05-07T20:31:24.3217473Z T=4096, 2025-05-07T20:31:24.3217669Z D=7168, 2025-05-07T20:31:24.3217872Z contiguous=True, 2025-05-07T20:31:24.3218099Z compiled=False, 2025-05-07T20:31:24.3218298Z ) 2025-05-07T20:31:24.3218504Z Trying example: test_silu_mul( 2025-05-07T20:31:24.3218876Z self=, 2025-05-07T20:31:24.3219245Z T=4096, 2025-05-07T20:31:24.3219438Z D=7168, 2025-05-07T20:31:24.3219642Z contiguous=False, 2025-05-07T20:31:24.3219992Z compiled=True, 2025-05-07T20:31:24.3220267Z ) 2025-05-07T20:31:24.3220543Z Trying example: test_silu_mul( 2025-05-07T20:31:24.3221050Z self=, 2025-05-07T20:31:24.3221569Z T=128, 2025-05-07T20:31:24.3221831Z D=5120, 2025-05-07T20:31:24.3222087Z contiguous=True, 2025-05-07T20:31:24.3222602Z compiled=False, 2025-05-07T20:31:24.3222896Z ) 2025-05-07T20:31:24.3223124Z Trying example: test_silu_mul( 2025-05-07T20:31:24.3223496Z self=, 2025-05-07T20:31:24.3223869Z T=128, 2025-05-07T20:31:24.3224049Z D=5120, 2025-05-07T20:31:24.3224251Z contiguous=False, 2025-05-07T20:31:24.3224478Z compiled=False, 2025-05-07T20:31:24.3224678Z ) 2025-05-07T20:31:24.3224882Z Trying example: test_silu_mul( 2025-05-07T20:31:24.3225258Z self=, 2025-05-07T20:31:24.3225630Z T=1, 2025-05-07T20:31:24.3225806Z D=5120, 2025-05-07T20:31:24.3226008Z contiguous=True, 2025-05-07T20:31:24.3226234Z compiled=False, 2025-05-07T20:31:24.3226434Z ) 2025-05-07T20:31:24.3226636Z Trying example: test_silu_mul( 2025-05-07T20:31:24.3227009Z self=, 2025-05-07T20:31:24.3227380Z T=2048, 2025-05-07T20:31:24.3227573Z D=7168, 2025-05-07T20:31:24.3227781Z contiguous=False, 2025-05-07T20:31:24.3228005Z compiled=True, 2025-05-07T20:31:24.3228224Z ) 2025-05-07T20:31:24.3228427Z Trying example: test_silu_mul( 2025-05-07T20:31:24.3228799Z self=, 2025-05-07T20:31:24.3229172Z T=2048, 2025-05-07T20:31:24.3229367Z D=7168, 2025-05-07T20:31:24.3229561Z contiguous=False, 2025-05-07T20:31:24.3229789Z compiled=False, 2025-05-07T20:31:24.3230008Z ) 2025-05-07T20:31:24.3230203Z Trying example: test_silu_mul( 2025-05-07T20:31:24.3230578Z self=, 2025-05-07T20:31:24.3230961Z T=16384, 2025-05-07T20:31:24.3231158Z D=7168, 2025-05-07T20:31:24.3231359Z contiguous=False, 2025-05-07T20:31:24.3231594Z compiled=True, 2025-05-07T20:31:24.3231804Z ) 2025-05-07T20:31:24.3232009Z Trying example: test_silu_mul( 2025-05-07T20:31:24.3232385Z self=, 2025-05-07T20:31:24.3232764Z T=16384, 2025-05-07T20:31:24.3232957Z D=7168, 2025-05-07T20:31:24.3233305Z contiguous=True, 2025-05-07T20:31:24.3233536Z compiled=True, 2025-05-07T20:31:24.3233737Z ) 2025-05-07T20:31:24.3233939Z Trying example: test_silu_mul( 2025-05-07T20:31:24.3234314Z self=, 2025-05-07T20:31:24.3234683Z T=4096, 2025-05-07T20:31:24.3234873Z D=7168, 2025-05-07T20:31:24.3235072Z contiguous=True, 2025-05-07T20:31:24.3235291Z compiled=True, 2025-05-07T20:31:24.3235496Z ) 2025-05-07T20:31:24.3235699Z Trying example: test_silu_mul( 2025-05-07T20:31:24.3236063Z self=, 2025-05-07T20:31:24.3236436Z T=2048, 2025-05-07T20:31:24.3236626Z D=5120, 2025-05-07T20:31:24.3236821Z contiguous=False, 2025-05-07T20:31:24.3237148Z compiled=False, 2025-05-07T20:31:24.3237353Z ) 2025-05-07T20:31:24.3237550Z Trying example: test_silu_mul( 2025-05-07T20:31:24.3238010Z self=, 2025-05-07T20:31:24.3238394Z T=2048, 2025-05-07T20:31:24.3238581Z D=5120, 2025-05-07T20:31:24.3238771Z contiguous=True, 2025-05-07T20:31:24.3238997Z compiled=False, 2025-05-07T20:31:24.3239204Z ) 2025-05-07T20:31:24.3239401Z Trying example: test_silu_mul( 2025-05-07T20:31:24.3239772Z self=, 2025-05-07T20:31:24.3240145Z T=128, 2025-05-07T20:31:24.3240327Z D=7168, 2025-05-07T20:31:24.3240527Z contiguous=False, 2025-05-07T20:31:24.3240752Z compiled=True, 2025-05-07T20:31:24.3240948Z ) 2025-05-07T20:31:24.3241155Z Trying example: test_silu_mul( 2025-05-07T20:31:24.3241544Z self=, 2025-05-07T20:31:24.3241920Z T=16384, 2025-05-07T20:31:24.3249610Z D=5120, 2025-05-07T20:31:24.3249888Z contiguous=True, 2025-05-07T20:31:24.3250157Z compiled=True, 2025-05-07T20:31:24.3250387Z ) 2025-05-07T20:31:24.3250599Z Trying example: test_silu_mul( 2025-05-07T20:31:24.3250990Z self=, 2025-05-07T20:31:24.3251382Z T=2048, 2025-05-07T20:31:24.3251584Z D=5120, 2025-05-07T20:31:24.3251788Z contiguous=False, 2025-05-07T20:31:24.3252026Z compiled=True, 2025-05-07T20:31:24.3252240Z ) 2025-05-07T20:31:24.3252441Z Trying example: test_silu_mul( 2025-05-07T20:31:24.3252820Z self=, 2025-05-07T20:31:24.3253202Z T=16384, 2025-05-07T20:31:24.3253401Z D=5120, 2025-05-07T20:31:24.3253606Z contiguous=True, 2025-05-07T20:31:24.3253840Z compiled=False, 2025-05-07T20:31:24.3254052Z ) 2025-05-07T20:31:24.3254262Z Trying example: test_silu_mul( 2025-05-07T20:31:24.3254652Z self=, 2025-05-07T20:31:24.3255028Z T=16384, 2025-05-07T20:31:24.3255229Z D=7168, 2025-05-07T20:31:24.3255439Z contiguous=False, 2025-05-07T20:31:24.3255672Z compiled=False, 2025-05-07T20:31:24.3255894Z ) 2025-05-07T20:31:24.3256101Z Trying example: test_silu_mul( 2025-05-07T20:31:24.3256473Z self=, 2025-05-07T20:31:24.3256859Z T=16384, 2025-05-07T20:31:24.3257059Z D=7168, 2025-05-07T20:31:24.3257258Z contiguous=True, 2025-05-07T20:31:24.3257488Z compiled=False, 2025-05-07T20:31:24.3257703Z ) 2025-05-07T20:31:24.3257905Z PASSED 2025-05-07T20:31:24.3842346Z W0507 20:31:24.382000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:24.3843447Z W0507 20:31:24.382000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last): 2025-05-07T20:31:24.3844824Z W0507 20:31:24.382000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:24.3846247Z W0507 20:31:24.382000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:24.3847605Z W0507 20:31:24.382000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:24.3848968Z W0507 20:31:24.382000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:24.3850746Z W0507 20:31:24.382000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:24.3852111Z W0507 20:31:24.382000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:24.3853529Z W0507 20:31:24.382000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:24.3854760Z W0507 20:31:24.382000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] generator.visit(fn.parse()) 2025-05-07T20:31:24.3855969Z W0507 20:31:24.382000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:24.3857167Z W0507 20:31:24.382000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ret = super().visit(node) 2025-05-07T20:31:24.3858191Z W0507 20:31:24.382000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:31:24.3859197Z W0507 20:31:24.382000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return visitor(node) 2025-05-07T20:31:24.3860746Z W0507 20:31:24.382000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:24.3862562Z W0507 20:31:24.382000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:24.3863670Z W0507 20:31:24.382000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:31:24.3864713Z W0507 20:31:24.382000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] self.visit(item) 2025-05-07T20:31:24.3865878Z W0507 20:31:24.382000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:24.3867217Z W0507 20:31:24.382000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:24.3868258Z W0507 20:31:24.382000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:24.3869169Z W0507 20:31:24.382000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:24.3869911Z W0507 20:31:24.382000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^ 2025-05-07T20:31:24.3870922Z W0507 20:31:24.382000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:24.4009372Z W0507 20:31:24.400000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:24.4010430Z W0507 20:31:24.400000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last): 2025-05-07T20:31:24.4012893Z W0507 20:31:24.400000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:24.4014316Z W0507 20:31:24.400000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:24.4015678Z W0507 20:31:24.400000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:24.4017043Z W0507 20:31:24.400000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:24.4018335Z W0507 20:31:24.400000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:24.4019707Z W0507 20:31:24.400000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:24.4021911Z W0507 20:31:24.400000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:24.4023209Z W0507 20:31:24.400000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] generator.visit(fn.parse()) 2025-05-07T20:31:24.4024414Z W0507 20:31:24.400000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:24.4025627Z W0507 20:31:24.400000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ret = super().visit(node) 2025-05-07T20:31:24.4026652Z W0507 20:31:24.400000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:31:24.4027656Z W0507 20:31:24.400000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return visitor(node) 2025-05-07T20:31:24.4028854Z W0507 20:31:24.400000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:24.4030112Z W0507 20:31:24.400000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:24.4031219Z W0507 20:31:24.400000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:31:24.4032245Z W0507 20:31:24.400000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] self.visit(item) 2025-05-07T20:31:24.4033404Z W0507 20:31:24.400000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:24.4034739Z W0507 20:31:24.400000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:24.4035774Z W0507 20:31:24.400000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:24.4036874Z W0507 20:31:24.400000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:24.4037617Z W0507 20:31:24.400000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^ 2025-05-07T20:31:24.4038622Z W0507 20:31:24.400000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:24.4420031Z W0507 20:31:24.441000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:24.4421117Z W0507 20:31:24.441000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last): 2025-05-07T20:31:24.4422458Z W0507 20:31:24.441000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:24.4423889Z W0507 20:31:24.441000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:24.4425252Z W0507 20:31:24.441000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:24.4426631Z W0507 20:31:24.441000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:24.4427914Z W0507 20:31:24.441000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:24.4429288Z W0507 20:31:24.441000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:24.4430696Z W0507 20:31:24.441000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:24.4431938Z W0507 20:31:24.441000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] generator.visit(fn.parse()) 2025-05-07T20:31:24.4433145Z W0507 20:31:24.441000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:24.4434344Z W0507 20:31:24.441000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ret = super().visit(node) 2025-05-07T20:31:24.4435372Z W0507 20:31:24.441000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:31:24.4436372Z W0507 20:31:24.441000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return visitor(node) 2025-05-07T20:31:24.4437579Z W0507 20:31:24.441000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:24.4438840Z W0507 20:31:24.441000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:24.4440430Z W0507 20:31:24.441000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:31:24.4441482Z W0507 20:31:24.441000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] self.visit(item) 2025-05-07T20:31:24.4442647Z W0507 20:31:24.441000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:24.4443997Z W0507 20:31:24.441000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:24.4445040Z W0507 20:31:24.441000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:24.4445941Z W0507 20:31:24.441000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:24.4446676Z W0507 20:31:24.441000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^ 2025-05-07T20:31:24.4447681Z W0507 20:31:24.441000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:24.4462782Z W0507 20:31:24.445000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:24.4463839Z W0507 20:31:24.445000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last): 2025-05-07T20:31:24.4465168Z W0507 20:31:24.445000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:24.4466589Z W0507 20:31:24.445000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:24.4467947Z W0507 20:31:24.445000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:24.4469321Z W0507 20:31:24.445000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:24.4470607Z W0507 20:31:24.445000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:24.4471976Z W0507 20:31:24.445000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:24.4473378Z W0507 20:31:24.445000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:24.4474610Z W0507 20:31:24.445000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] generator.visit(fn.parse()) 2025-05-07T20:31:24.4475821Z W0507 20:31:24.445000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:24.4477169Z W0507 20:31:24.445000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ret = super().visit(node) 2025-05-07T20:31:24.4478265Z W0507 20:31:24.445000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:31:24.4479283Z W0507 20:31:24.445000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return visitor(node) 2025-05-07T20:31:24.4480488Z W0507 20:31:24.445000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:24.4481762Z W0507 20:31:24.445000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:24.4482868Z W0507 20:31:24.445000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:31:24.4483906Z W0507 20:31:24.445000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] self.visit(item) 2025-05-07T20:31:24.4485066Z W0507 20:31:24.445000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:24.4486403Z W0507 20:31:24.445000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:24.4487452Z W0507 20:31:24.445000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:24.4488349Z W0507 20:31:24.445000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:24.4489089Z W0507 20:31:24.445000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^ 2025-05-07T20:31:24.4490386Z W0507 20:31:24.445000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:24.8892400Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant( 2025-05-07T20:31:24.8893208Z self=, 2025-05-07T20:31:24.8893634Z T=1, 2025-05-07T20:31:24.8893833Z D=5120, 2025-05-07T20:31:24.8894027Z scale_ub=None, 2025-05-07T20:31:24.8894251Z contiguous=True, 2025-05-07T20:31:24.8894484Z compiled=True, 2025-05-07T20:31:24.8894700Z ) 2025-05-07T20:31:24.8895034Z self = 2025-05-07T20:31:24.8895566Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:24.8895848Z 2025-05-07T20:31:24.8895933Z @given( 2025-05-07T20:31:24.8896172Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:24.8896493Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:24.8896809Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:24.8897150Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:24.8897491Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:24.8897782Z ) 2025-05-07T20:31:24.8898138Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:24.8898598Z def test_silu_mul_quant( 2025-05-07T20:31:24.8898848Z self, 2025-05-07T20:31:24.8899050Z T: int, 2025-05-07T20:31:24.8899250Z D: int, 2025-05-07T20:31:24.8899477Z scale_ub: Optional[float], 2025-05-07T20:31:24.8900316Z contiguous: bool, 2025-05-07T20:31:24.8900558Z compiled: bool, 2025-05-07T20:31:24.8900789Z ) -> None: 2025-05-07T20:31:24.8901155Z torch.manual_seed(2025) 2025-05-07T20:31:24.8901404Z 2025-05-07T20:31:24.8901694Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:24.8902046Z 2025-05-07T20:31:24.8902239Z x_sign = torch.sign(x) 2025-05-07T20:31:24.8902541Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:24.8902860Z x = x_sign * x_clamp 2025-05-07T20:31:24.8903104Z x0 = x[:, :D] 2025-05-07T20:31:24.8903331Z x1 = x[:, D:] 2025-05-07T20:31:24.8903544Z 2025-05-07T20:31:24.8903729Z if contiguous: 2025-05-07T20:31:24.8903971Z x0 = x0.contiguous() 2025-05-07T20:31:24.8904239Z x1 = x1.contiguous() 2025-05-07T20:31:24.8904479Z 2025-05-07T20:31:24.8904678Z if scale_ub is not None: 2025-05-07T20:31:24.8904961Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:24.8905319Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:24.8905631Z ) 2025-05-07T20:31:24.8905837Z else: 2025-05-07T20:31:24.8906059Z scale_ub_tensor = None 2025-05-07T20:31:24.8906317Z 2025-05-07T20:31:24.8906561Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:24.8906882Z op = silu_mul_quant 2025-05-07T20:31:24.8907137Z if compiled: 2025-05-07T20:31:24.8907397Z op = torch.compile(op) 2025-05-07T20:31:24.8907703Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:24.8907978Z 2025-05-07T20:31:24.8908179Z y_fp8, y_scale = fn() 2025-05-07T20:31:24.8908481Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:24.8908772Z 2025-05-07T20:31:24.8909020Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:24.8909365Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:24.8909670Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:24.8909989Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:24.8910362Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:24.8910678Z 2025-05-07T20:31:24.8910879Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:24.8911084Z 2025-05-07T20:31:24.8911190Z moe/activation_test.py:126: 2025-05-07T20:31:24.8911498Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:24.8911837Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:24.8912174Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:24.8912989Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:24.8913765Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:24.8914323Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:24.8915035Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:24.8915751Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:24.8916498Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:24.8917268Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:24.8918040Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:24.8918793Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:24.8919445Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:24.8920107Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:24.8920745Z fn() 2025-05-07T20:31:24.8921341Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:24.8921932Z self.fn.run( 2025-05-07T20:31:24.8922415Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:24.8922964Z kernel = self.compile( 2025-05-07T20:31:24.8923512Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:24.8924185Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:24.8924593Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:24.8924828Z 2025-05-07T20:31:24.8925047Z self = 2025-05-07T20:31:24.8926159Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:24.8927717Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c7e568550>} 2025-05-07T20:31:24.8929238Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:24.8930364Z context = 2025-05-07T20:31:24.8930668Z 2025-05-07T20:31:24.8930841Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:24.8931384Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:24.8931874Z module_map=module_map) 2025-05-07T20:31:24.8932261Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:24.8932634Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:24.8932900Z E ^ 2025-05-07T20:31:24.8933381Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:24.8933850Z 2025-05-07T20:31:24.8934282Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:24.8934812Z 2025-05-07T20:31:24.8934926Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:24.8935351Z self=, 2025-05-07T20:31:24.8935765Z T=2048, 2025-05-07T20:31:24.8935966Z D=5120, 2025-05-07T20:31:24.8936159Z scale_ub=1200.0, 2025-05-07T20:31:24.8936390Z contiguous=True, 2025-05-07T20:31:24.8936624Z compiled=False, 2025-05-07T20:31:24.8936829Z ) 2025-05-07T20:31:25.4284606Z W0507 20:31:25.424000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:25.4286225Z W0507 20:31:25.424000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last): 2025-05-07T20:31:25.4288236Z W0507 20:31:25.424000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:25.4290567Z W0507 20:31:25.424000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:25.4292279Z W0507 20:31:25.424000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:25.4293807Z W0507 20:31:25.424000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:25.4295108Z W0507 20:31:25.424000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:25.4296478Z W0507 20:31:25.424000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:25.4297874Z W0507 20:31:25.424000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:25.4299126Z W0507 20:31:25.424000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] generator.visit(fn.parse()) 2025-05-07T20:31:25.4300432Z W0507 20:31:25.424000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:25.4301642Z W0507 20:31:25.424000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ret = super().visit(node) 2025-05-07T20:31:25.4302673Z W0507 20:31:25.424000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:31:25.4303684Z W0507 20:31:25.424000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return visitor(node) 2025-05-07T20:31:25.4304897Z W0507 20:31:25.424000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:25.4306152Z W0507 20:31:25.424000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:25.4307253Z W0507 20:31:25.424000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:31:25.4308281Z W0507 20:31:25.424000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] self.visit(item) 2025-05-07T20:31:25.4309446Z W0507 20:31:25.424000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:25.4310797Z W0507 20:31:25.424000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:25.4311845Z W0507 20:31:25.424000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:25.4312746Z W0507 20:31:25.424000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:25.4313478Z W0507 20:31:25.424000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^ 2025-05-07T20:31:25.4314483Z W0507 20:31:25.424000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:25.6075241Z W0507 20:31:25.604000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:25.6076332Z W0507 20:31:25.604000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last): 2025-05-07T20:31:25.6077659Z W0507 20:31:25.604000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:25.6079097Z W0507 20:31:25.604000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:25.6080467Z W0507 20:31:25.604000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:25.6081850Z W0507 20:31:25.604000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:25.6083141Z W0507 20:31:25.604000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:25.6084515Z W0507 20:31:25.604000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:25.6085910Z W0507 20:31:25.604000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:25.6087142Z W0507 20:31:25.604000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] generator.visit(fn.parse()) 2025-05-07T20:31:25.6088352Z W0507 20:31:25.604000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:25.6089546Z W0507 20:31:25.604000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ret = super().visit(node) 2025-05-07T20:31:25.6090814Z W0507 20:31:25.604000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:31:25.6091819Z W0507 20:31:25.604000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return visitor(node) 2025-05-07T20:31:25.6093028Z W0507 20:31:25.604000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:25.6094289Z W0507 20:31:25.604000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:25.6095389Z W0507 20:31:25.604000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:31:25.6096416Z W0507 20:31:25.604000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] self.visit(item) 2025-05-07T20:31:25.6097577Z W0507 20:31:25.604000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:25.6099192Z W0507 20:31:25.604000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:25.6100328Z W0507 20:31:25.604000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:25.6101228Z W0507 20:31:25.604000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:25.6101960Z W0507 20:31:25.604000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^ 2025-05-07T20:31:25.6102965Z W0507 20:31:25.604000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:26.1026850Z W0507 20:31:26.099000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:26.1028103Z W0507 20:31:26.099000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last): 2025-05-07T20:31:26.1029437Z W0507 20:31:26.099000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:26.1030859Z W0507 20:31:26.099000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:26.1032226Z W0507 20:31:26.099000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:26.1033598Z W0507 20:31:26.099000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:26.1034896Z W0507 20:31:26.099000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:26.1036263Z W0507 20:31:26.099000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:26.1037647Z W0507 20:31:26.099000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:26.1038897Z W0507 20:31:26.099000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] generator.visit(fn.parse()) 2025-05-07T20:31:26.1040101Z W0507 20:31:26.099000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:26.1041286Z W0507 20:31:26.099000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ret = super().visit(node) 2025-05-07T20:31:26.1042310Z W0507 20:31:26.099000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:31:26.1043308Z W0507 20:31:26.099000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return visitor(node) 2025-05-07T20:31:26.1044847Z W0507 20:31:26.099000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:26.1046250Z W0507 20:31:26.099000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:26.1047345Z W0507 20:31:26.099000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:31:26.1048371Z W0507 20:31:26.099000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] self.visit(item) 2025-05-07T20:31:26.1049522Z W0507 20:31:26.099000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:26.1050870Z W0507 20:31:26.099000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:26.1051911Z W0507 20:31:26.099000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:26.1052809Z W0507 20:31:26.099000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:26.1053536Z W0507 20:31:26.099000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^ 2025-05-07T20:31:26.1054532Z W0507 20:31:26.099000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:26.1320747Z W0507 20:31:26.128000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:26.1322014Z W0507 20:31:26.128000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last): 2025-05-07T20:31:26.1323336Z W0507 20:31:26.128000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:26.1324751Z W0507 20:31:26.128000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:26.1326105Z W0507 20:31:26.128000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:26.1327488Z W0507 20:31:26.128000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:26.1328772Z W0507 20:31:26.128000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:26.1330137Z W0507 20:31:26.128000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:26.1331543Z W0507 20:31:26.128000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:26.1332785Z W0507 20:31:26.128000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] generator.visit(fn.parse()) 2025-05-07T20:31:26.1334367Z W0507 20:31:26.128000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:26.1335582Z W0507 20:31:26.128000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ret = super().visit(node) 2025-05-07T20:31:26.1336607Z W0507 20:31:26.128000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:31:26.1337615Z W0507 20:31:26.128000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return visitor(node) 2025-05-07T20:31:26.1338815Z W0507 20:31:26.128000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:26.1340282Z W0507 20:31:26.128000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:26.1341907Z W0507 20:31:26.128000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:31:26.1343211Z W0507 20:31:26.128000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] self.visit(item) 2025-05-07T20:31:26.1344376Z W0507 20:31:26.128000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:26.1345724Z W0507 20:31:26.128000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:26.1346783Z W0507 20:31:26.128000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:26.1347691Z W0507 20:31:26.128000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:26.1348427Z W0507 20:31:26.128000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^ 2025-05-07T20:31:26.1349438Z W0507 20:31:26.128000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:26.8330146Z self = 2025-05-07T20:31:26.8330802Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:26.8331230Z 2025-05-07T20:31:26.8331340Z @given( 2025-05-07T20:31:26.8331675Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:26.8332016Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:26.8332322Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:26.8332676Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:26.8332999Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:26.8333286Z ) 2025-05-07T20:31:26.8341191Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:26.8341684Z def test_silu_mul_quant( 2025-05-07T20:31:26.8341932Z self, 2025-05-07T20:31:26.8342139Z T: int, 2025-05-07T20:31:26.8342521Z D: int, 2025-05-07T20:31:26.8342784Z scale_ub: Optional[float], 2025-05-07T20:31:26.8343068Z contiguous: bool, 2025-05-07T20:31:26.8343310Z compiled: bool, 2025-05-07T20:31:26.8343916Z ) -> None: 2025-05-07T20:31:26.8344145Z torch.manual_seed(2025) 2025-05-07T20:31:26.8344392Z 2025-05-07T20:31:26.8344811Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:26.8345169Z 2025-05-07T20:31:26.8345365Z x_sign = torch.sign(x) 2025-05-07T20:31:26.8345666Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:26.8345986Z x = x_sign * x_clamp 2025-05-07T20:31:26.8346230Z x0 = x[:, :D] 2025-05-07T20:31:26.8346455Z x1 = x[:, D:] 2025-05-07T20:31:26.8346672Z 2025-05-07T20:31:26.8346860Z if contiguous: 2025-05-07T20:31:26.8347100Z x0 = x0.contiguous() 2025-05-07T20:31:26.8347365Z x1 = x1.contiguous() 2025-05-07T20:31:26.8347604Z 2025-05-07T20:31:26.8347809Z if scale_ub is not None: 2025-05-07T20:31:26.8348085Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:26.8348432Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:26.8348748Z ) 2025-05-07T20:31:26.8348950Z else: 2025-05-07T20:31:26.8349169Z scale_ub_tensor = None 2025-05-07T20:31:26.8349427Z 2025-05-07T20:31:26.8349666Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:26.8349986Z op = silu_mul_quant 2025-05-07T20:31:26.8350240Z if compiled: 2025-05-07T20:31:26.8350495Z op = torch.compile(op) 2025-05-07T20:31:26.8350799Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:26.8351072Z 2025-05-07T20:31:26.8351277Z > y_fp8, y_scale = fn() 2025-05-07T20:31:26.8351442Z 2025-05-07T20:31:26.8351558Z moe/activation_test.py:117: 2025-05-07T20:31:26.8351859Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:26.8352196Z moe/activation_test.py:115: in fn 2025-05-07T20:31:26.8352560Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:26.8353332Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:26.8354043Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:26.8354584Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:26.8355270Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:26.8355942Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:26.8356470Z kernel = self.compile( 2025-05-07T20:31:26.8357019Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:26.8357685Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:26.8358081Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:26.8358318Z 2025-05-07T20:31:26.8358534Z self = 2025-05-07T20:31:26.8359619Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:26.8361001Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c7f67b250>} 2025-05-07T20:31:26.8362340Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:26.8363370Z context = 2025-05-07T20:31:26.8363666Z 2025-05-07T20:31:26.8363834Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:26.8364465Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:26.8365016Z module_map=module_map) 2025-05-07T20:31:26.8365388Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:26.8365757Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:26.8366026Z E ^ 2025-05-07T20:31:26.8366489Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:26.8366945Z 2025-05-07T20:31:26.8367366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:26.8367884Z 2025-05-07T20:31:26.8367996Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:26.8368413Z self=, 2025-05-07T20:31:26.8368819Z T=2048, 2025-05-07T20:31:26.8369018Z D=5120, 2025-05-07T20:31:26.8369227Z scale_ub=1200.0, 2025-05-07T20:31:26.8369453Z contiguous=True, 2025-05-07T20:31:26.8369688Z compiled=True, 2025-05-07T20:31:26.8369901Z ) 2025-05-07T20:31:26.8370222Z self = 2025-05-07T20:31:26.8370717Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:26.8370982Z 2025-05-07T20:31:26.8371061Z @given( 2025-05-07T20:31:26.8371296Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:26.8371609Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:26.8371912Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:26.8372244Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:26.8372573Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:26.8372857Z ) 2025-05-07T20:31:26.8373206Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:26.8373646Z def test_silu_mul_quant( 2025-05-07T20:31:26.8373898Z self, 2025-05-07T20:31:26.8374093Z T: int, 2025-05-07T20:31:26.8374298Z D: int, 2025-05-07T20:31:26.8374527Z scale_ub: Optional[float], 2025-05-07T20:31:26.8374801Z contiguous: bool, 2025-05-07T20:31:26.8375046Z compiled: bool, 2025-05-07T20:31:26.8375274Z ) -> None: 2025-05-07T20:31:26.8375487Z torch.manual_seed(2025) 2025-05-07T20:31:26.8375741Z 2025-05-07T20:31:26.8376018Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:26.8376355Z 2025-05-07T20:31:26.8376557Z x_sign = torch.sign(x) 2025-05-07T20:31:26.8376854Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:26.8377158Z x = x_sign * x_clamp 2025-05-07T20:31:26.8377405Z x0 = x[:, :D] 2025-05-07T20:31:26.8377633Z x1 = x[:, D:] 2025-05-07T20:31:26.8377843Z 2025-05-07T20:31:26.8378040Z if contiguous: 2025-05-07T20:31:26.8378280Z x0 = x0.contiguous() 2025-05-07T20:31:26.8378549Z x1 = x1.contiguous() 2025-05-07T20:31:26.8378789Z 2025-05-07T20:31:26.8379001Z if scale_ub is not None: 2025-05-07T20:31:26.8379279Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:26.8379608Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:26.8380047Z ) 2025-05-07T20:31:26.8380249Z else: 2025-05-07T20:31:26.8380465Z scale_ub_tensor = None 2025-05-07T20:31:26.8380764Z 2025-05-07T20:31:26.8381002Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:26.8381309Z op = silu_mul_quant 2025-05-07T20:31:26.8381565Z if compiled: 2025-05-07T20:31:26.8381817Z op = torch.compile(op) 2025-05-07T20:31:26.8382110Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:26.8382388Z 2025-05-07T20:31:26.8382584Z y_fp8, y_scale = fn() 2025-05-07T20:31:26.8382876Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:26.8383274Z 2025-05-07T20:31:26.8383516Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:26.8384417Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:26.8384719Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:26.8385039Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:26.8385404Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:26.8385711Z 2025-05-07T20:31:26.8385924Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:26.8386124Z 2025-05-07T20:31:26.8386237Z moe/activation_test.py:126: 2025-05-07T20:31:26.8386532Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:26.8386868Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:26.8387197Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:26.8387985Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:26.8388741Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:26.8389294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:26.8390336Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:26.8391023Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:26.8391740Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:26.8392489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:26.8393227Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:26.8393943Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:26.8394590Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:26.8395188Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:26.8395704Z fn() 2025-05-07T20:31:26.8396206Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:26.8396782Z self.fn.run( 2025-05-07T20:31:26.8397257Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:26.8397780Z kernel = self.compile( 2025-05-07T20:31:26.8398316Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:26.8398973Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:26.8399367Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:26.8399598Z 2025-05-07T20:31:26.8399812Z self = 2025-05-07T20:31:26.8400884Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:26.8402263Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c7f67a950>} 2025-05-07T20:31:26.8403593Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:26.8404625Z context = 2025-05-07T20:31:26.8405072Z 2025-05-07T20:31:26.8405241Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:26.8405896Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:26.8406365Z module_map=module_map) 2025-05-07T20:31:26.8406723Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:26.8407075Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:26.8407338Z E ^ 2025-05-07T20:31:26.8407796Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:26.8408238Z 2025-05-07T20:31:26.8408649Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:26.8409163Z 2025-05-07T20:31:26.8409266Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:26.8409674Z self=, 2025-05-07T20:31:26.8410071Z T=16384, 2025-05-07T20:31:26.8410266Z D=7168, 2025-05-07T20:31:26.8410460Z scale_ub=1200.0, 2025-05-07T20:31:26.8410689Z contiguous=False, 2025-05-07T20:31:26.8410913Z compiled=False, 2025-05-07T20:31:26.8411120Z ) 2025-05-07T20:31:27.2118732Z W0507 20:31:27.208000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:27.2119980Z W0507 20:31:27.208000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last): 2025-05-07T20:31:27.2121369Z W0507 20:31:27.208000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:27.2122803Z W0507 20:31:27.208000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:27.2124220Z W0507 20:31:27.208000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:27.2125603Z W0507 20:31:27.208000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:27.2126902Z W0507 20:31:27.208000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:27.2128267Z W0507 20:31:27.208000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:27.2129685Z W0507 20:31:27.208000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:27.2130917Z W0507 20:31:27.208000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] generator.visit(fn.parse()) 2025-05-07T20:31:27.2132137Z W0507 20:31:27.208000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:27.2133343Z W0507 20:31:27.208000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ret = super().visit(node) 2025-05-07T20:31:27.2134380Z W0507 20:31:27.208000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:31:27.2135864Z W0507 20:31:27.208000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return visitor(node) 2025-05-07T20:31:27.2137078Z W0507 20:31:27.208000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:27.2138355Z W0507 20:31:27.208000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:27.2139470Z W0507 20:31:27.208000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:31:27.2140607Z W0507 20:31:27.208000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] self.visit(item) 2025-05-07T20:31:27.2141798Z W0507 20:31:27.208000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:27.2143148Z W0507 20:31:27.208000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:27.2144208Z W0507 20:31:27.208000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:27.2145118Z W0507 20:31:27.208000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:27.2145858Z W0507 20:31:27.208000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^ 2025-05-07T20:31:27.2146879Z W0507 20:31:27.208000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:27.3541641Z W0507 20:31:27.350000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:27.3543016Z W0507 20:31:27.350000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last): 2025-05-07T20:31:27.3544339Z W0507 20:31:27.350000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:27.3545746Z W0507 20:31:27.350000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:27.3547140Z W0507 20:31:27.350000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:27.3548496Z W0507 20:31:27.350000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:27.3549790Z W0507 20:31:27.350000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:27.3551143Z W0507 20:31:27.350000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:27.3553025Z W0507 20:31:27.350000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:27.3554254Z W0507 20:31:27.350000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] generator.visit(fn.parse()) 2025-05-07T20:31:27.3555458Z W0507 20:31:27.350000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:27.3556649Z W0507 20:31:27.350000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ret = super().visit(node) 2025-05-07T20:31:27.3557664Z W0507 20:31:27.350000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:31:27.3558678Z W0507 20:31:27.350000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return visitor(node) 2025-05-07T20:31:27.3559877Z W0507 20:31:27.350000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:27.3561284Z W0507 20:31:27.350000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:27.3562383Z W0507 20:31:27.350000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:31:27.3563409Z W0507 20:31:27.350000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] self.visit(item) 2025-05-07T20:31:27.3564580Z W0507 20:31:27.350000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:27.3565919Z W0507 20:31:27.350000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:27.3566954Z W0507 20:31:27.350000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:27.3567850Z W0507 20:31:27.350000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:27.3568582Z W0507 20:31:27.350000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^ 2025-05-07T20:31:27.3569589Z W0507 20:31:27.350000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:27.7946766Z W0507 20:31:27.791000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:27.7947883Z W0507 20:31:27.791000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last): 2025-05-07T20:31:27.7949214Z W0507 20:31:27.791000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:27.7950636Z W0507 20:31:27.791000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:27.7952548Z W0507 20:31:27.791000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:27.7953920Z W0507 20:31:27.791000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:27.7955219Z W0507 20:31:27.791000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:27.7956583Z W0507 20:31:27.791000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:27.7958006Z W0507 20:31:27.791000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:27.7959247Z W0507 20:31:27.791000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] generator.visit(fn.parse()) 2025-05-07T20:31:27.7960455Z W0507 20:31:27.791000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:27.7961702Z W0507 20:31:27.791000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ret = super().visit(node) 2025-05-07T20:31:27.7962735Z W0507 20:31:27.791000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:31:27.7963757Z W0507 20:31:27.791000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return visitor(node) 2025-05-07T20:31:27.7964969Z W0507 20:31:27.791000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:27.7966239Z W0507 20:31:27.791000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:27.7967342Z W0507 20:31:27.791000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:31:27.7968379Z W0507 20:31:27.791000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] self.visit(item) 2025-05-07T20:31:27.7969566Z W0507 20:31:27.791000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:27.7970919Z W0507 20:31:27.791000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:27.7971965Z W0507 20:31:27.791000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:27.7972871Z W0507 20:31:27.791000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:27.7973611Z W0507 20:31:27.791000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^ 2025-05-07T20:31:27.7974631Z W0507 20:31:27.791000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:27.8246666Z W0507 20:31:27.821000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:27.8247771Z W0507 20:31:27.821000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last): 2025-05-07T20:31:27.8249080Z W0507 20:31:27.821000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:27.8250486Z W0507 20:31:27.821000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:27.8251868Z W0507 20:31:27.821000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:27.8253244Z W0507 20:31:27.821000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:27.8254540Z W0507 20:31:27.821000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:27.8255911Z W0507 20:31:27.821000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:27.8257314Z W0507 20:31:27.821000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:27.8258554Z W0507 20:31:27.821000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] generator.visit(fn.parse()) 2025-05-07T20:31:27.8259776Z W0507 20:31:27.821000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:27.8261074Z W0507 20:31:27.821000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ret = super().visit(node) 2025-05-07T20:31:27.8262109Z W0507 20:31:27.821000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:31:27.8263117Z W0507 20:31:27.821000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return visitor(node) 2025-05-07T20:31:27.8264344Z W0507 20:31:27.821000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:27.8265632Z W0507 20:31:27.821000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:27.8266744Z W0507 20:31:27.821000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:31:27.8267775Z W0507 20:31:27.821000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] self.visit(item) 2025-05-07T20:31:27.8269091Z W0507 20:31:27.821000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:27.8270576Z W0507 20:31:27.821000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:27.8271636Z W0507 20:31:27.821000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:27.8272706Z W0507 20:31:27.821000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:27.8273462Z W0507 20:31:27.821000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^ 2025-05-07T20:31:27.8274469Z W0507 20:31:27.821000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:29.1462142Z self = 2025-05-07T20:31:29.1463083Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:29.1463563Z 2025-05-07T20:31:29.1463692Z @given( 2025-05-07T20:31:29.1464061Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:29.1464566Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:29.1465067Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:29.1465591Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:29.1466053Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:29.1466473Z ) 2025-05-07T20:31:29.1466997Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:29.1467677Z def test_silu_mul_quant( 2025-05-07T20:31:29.1468047Z self, 2025-05-07T20:31:29.1468352Z T: int, 2025-05-07T20:31:29.1468657Z D: int, 2025-05-07T20:31:29.1469008Z scale_ub: Optional[float], 2025-05-07T20:31:29.1469484Z contiguous: bool, 2025-05-07T20:31:29.1469868Z compiled: bool, 2025-05-07T20:31:29.1470238Z ) -> None: 2025-05-07T20:31:29.1470597Z torch.manual_seed(2025) 2025-05-07T20:31:29.1471010Z 2025-05-07T20:31:29.1471463Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:29.1472041Z 2025-05-07T20:31:29.1472346Z x_sign = torch.sign(x) 2025-05-07T20:31:29.1472808Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:29.1473337Z x = x_sign * x_clamp 2025-05-07T20:31:29.1473744Z x0 = x[:, :D] 2025-05-07T20:31:29.1474073Z x1 = x[:, D:] 2025-05-07T20:31:29.1474401Z 2025-05-07T20:31:29.1474695Z if contiguous: 2025-05-07T20:31:29.1475065Z x0 = x0.contiguous() 2025-05-07T20:31:29.1475495Z x1 = x1.contiguous() 2025-05-07T20:31:29.1475900Z 2025-05-07T20:31:29.1476210Z if scale_ub is not None: 2025-05-07T20:31:29.1476654Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:29.1477213Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:29.1477732Z ) 2025-05-07T20:31:29.1478041Z else: 2025-05-07T20:31:29.1478393Z scale_ub_tensor = None 2025-05-07T20:31:29.1478825Z 2025-05-07T20:31:29.1479201Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:29.1479734Z op = silu_mul_quant 2025-05-07T20:31:29.1480146Z if compiled: 2025-05-07T20:31:29.1480545Z op = torch.compile(op) 2025-05-07T20:31:29.1481039Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:29.1481500Z 2025-05-07T20:31:29.1481807Z > y_fp8, y_scale = fn() 2025-05-07T20:31:29.1482098Z 2025-05-07T20:31:29.1482262Z moe/activation_test.py:117: 2025-05-07T20:31:29.1482766Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:29.1483796Z moe/activation_test.py:115: in fn 2025-05-07T20:31:29.1484458Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:29.1485695Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:29.1486929Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:29.1487854Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:29.1489029Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:29.1490519Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:29.1491466Z kernel = self.compile( 2025-05-07T20:31:29.1492409Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:29.1493583Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:29.1494287Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:29.1494686Z 2025-05-07T20:31:29.1495028Z self = 2025-05-07T20:31:29.1496948Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:29.1499350Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c7e5be4d0>} 2025-05-07T20:31:29.1501821Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:29.1503600Z context = 2025-05-07T20:31:29.1504081Z 2025-05-07T20:31:29.1504355Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:29.1505229Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:29.1506019Z module_map=module_map) 2025-05-07T20:31:29.1506607Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:29.1507169Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:29.1507595Z E ^ 2025-05-07T20:31:29.1508372Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:29.1509153Z 2025-05-07T20:31:29.1509880Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:29.1510785Z 2025-05-07T20:31:29.1510960Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:29.1511648Z self=, 2025-05-07T20:31:29.1512331Z T=1, 2025-05-07T20:31:29.1512616Z D=7168, 2025-05-07T20:31:29.1512928Z scale_ub=None, 2025-05-07T20:31:29.1513275Z contiguous=True, 2025-05-07T20:31:29.1513624Z compiled=True, 2025-05-07T20:31:29.1513957Z ) 2025-05-07T20:31:29.1514485Z self = 2025-05-07T20:31:29.1515286Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:29.1515724Z 2025-05-07T20:31:29.1515846Z @given( 2025-05-07T20:31:29.1516213Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:29.1516739Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:29.1517267Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:29.1517819Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:29.1518344Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:29.1519050Z ) 2025-05-07T20:31:29.1529185Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:29.1530011Z def test_silu_mul_quant( 2025-05-07T20:31:29.1530425Z self, 2025-05-07T20:31:29.1530739Z T: int, 2025-05-07T20:31:29.1531054Z D: int, 2025-05-07T20:31:29.1531454Z scale_ub: Optional[float], 2025-05-07T20:31:29.1531913Z contiguous: bool, 2025-05-07T20:31:29.1532300Z compiled: bool, 2025-05-07T20:31:29.1532676Z ) -> None: 2025-05-07T20:31:29.1533022Z torch.manual_seed(2025) 2025-05-07T20:31:29.1533405Z 2025-05-07T20:31:29.1533856Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:29.1534436Z 2025-05-07T20:31:29.1534746Z x_sign = torch.sign(x) 2025-05-07T20:31:29.1535235Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:29.1535763Z x = x_sign * x_clamp 2025-05-07T20:31:29.1536164Z x0 = x[:, :D] 2025-05-07T20:31:29.1536514Z x1 = x[:, D:] 2025-05-07T20:31:29.1536858Z 2025-05-07T20:31:29.1537175Z if contiguous: 2025-05-07T20:31:29.1537556Z x0 = x0.contiguous() 2025-05-07T20:31:29.1537983Z x1 = x1.contiguous() 2025-05-07T20:31:29.1538378Z 2025-05-07T20:31:29.1538682Z if scale_ub is not None: 2025-05-07T20:31:29.1539141Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:29.1539703Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:29.1540360Z ) 2025-05-07T20:31:29.1540674Z else: 2025-05-07T20:31:29.1541018Z scale_ub_tensor = None 2025-05-07T20:31:29.1541416Z 2025-05-07T20:31:29.1541796Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:29.1542322Z op = silu_mul_quant 2025-05-07T20:31:29.1542723Z if compiled: 2025-05-07T20:31:29.1543130Z op = torch.compile(op) 2025-05-07T20:31:29.1543635Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:29.1544087Z 2025-05-07T20:31:29.1544404Z y_fp8, y_scale = fn() 2025-05-07T20:31:29.1544876Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:29.1545348Z 2025-05-07T20:31:29.1545732Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:29.1546293Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:29.1546785Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:29.1547308Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:29.1547916Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:29.1548446Z 2025-05-07T20:31:29.1548768Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:29.1549110Z 2025-05-07T20:31:29.1549273Z moe/activation_test.py:126: 2025-05-07T20:31:29.1549779Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:29.1550360Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:29.1550907Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:29.1552331Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:29.1553693Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:29.1554611Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:29.1555797Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:29.1556983Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:29.1558270Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:29.1559552Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:29.1561007Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:29.1562320Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:29.1563460Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:29.1564515Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:29.1565433Z fn() 2025-05-07T20:31:29.1566327Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:29.1567353Z self.fn.run( 2025-05-07T20:31:29.1568172Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:29.1569109Z kernel = self.compile( 2025-05-07T20:31:29.1570059Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:29.1571224Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:29.1571905Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:29.1572303Z 2025-05-07T20:31:29.1572660Z self = 2025-05-07T20:31:29.1574567Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:29.1577050Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1cb1f90160>} 2025-05-07T20:31:29.1579464Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:29.1581408Z context = 2025-05-07T20:31:29.1581910Z 2025-05-07T20:31:29.1582180Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:29.1583044Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:29.1583821Z module_map=module_map) 2025-05-07T20:31:29.1584418Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:29.1585003Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:29.1585434Z E ^ 2025-05-07T20:31:29.1586197Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:29.1586993Z 2025-05-07T20:31:29.1587740Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:29.1588685Z 2025-05-07T20:31:29.1588870Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:29.1589588Z self=, 2025-05-07T20:31:29.1590590Z T=4096, 2025-05-07T20:31:29.1590899Z D=5120, 2025-05-07T20:31:29.1591212Z scale_ub=None, 2025-05-07T20:31:29.1591554Z contiguous=False, 2025-05-07T20:31:29.1591928Z compiled=False, 2025-05-07T20:31:29.1592265Z ) 2025-05-07T20:31:29.7126225Z W0507 20:31:29.709000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:29.7128190Z W0507 20:31:29.709000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last): 2025-05-07T20:31:29.7131074Z W0507 20:31:29.709000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:29.7133922Z W0507 20:31:29.709000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:29.7136473Z W0507 20:31:29.709000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:29.7138894Z W0507 20:31:29.709000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:29.7141351Z W0507 20:31:29.709000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:29.7143783Z W0507 20:31:29.709000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:29.7146274Z W0507 20:31:29.709000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:29.7148502Z W0507 20:31:29.709000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] generator.visit(fn.parse()) 2025-05-07T20:31:29.7150704Z W0507 20:31:29.709000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:29.7152888Z W0507 20:31:29.709000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ret = super().visit(node) 2025-05-07T20:31:29.7154737Z W0507 20:31:29.709000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:31:29.7156507Z W0507 20:31:29.709000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return visitor(node) 2025-05-07T20:31:29.7158689Z W0507 20:31:29.709000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:29.7160997Z W0507 20:31:29.709000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:29.7163001Z W0507 20:31:29.709000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:31:29.7164877Z W0507 20:31:29.709000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] self.visit(item) 2025-05-07T20:31:29.7166987Z W0507 20:31:29.709000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:29.7169423Z W0507 20:31:29.709000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:29.7171306Z W0507 20:31:29.709000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:29.7172917Z W0507 20:31:29.709000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:29.7174486Z W0507 20:31:29.709000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^ 2025-05-07T20:31:29.7176295Z W0507 20:31:29.709000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:30.2414635Z W0507 20:31:30.238000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:30.2416589Z W0507 20:31:30.238000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last): 2025-05-07T20:31:30.2419051Z W0507 20:31:30.238000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:30.2421822Z W0507 20:31:30.238000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:30.2424362Z W0507 20:31:30.238000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:30.2426803Z W0507 20:31:30.238000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:30.2429078Z W0507 20:31:30.238000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:30.2431588Z W0507 20:31:30.238000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:30.2434190Z W0507 20:31:30.238000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:30.2436471Z W0507 20:31:30.238000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] generator.visit(fn.parse()) 2025-05-07T20:31:30.2438727Z W0507 20:31:30.238000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:30.2440942Z W0507 20:31:30.238000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ret = super().visit(node) 2025-05-07T20:31:30.2442842Z W0507 20:31:30.238000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:31:30.2444634Z W0507 20:31:30.238000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return visitor(node) 2025-05-07T20:31:30.2446701Z W0507 20:31:30.238000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:30.2448923Z W0507 20:31:30.238000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:30.2450894Z W0507 20:31:30.238000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:31:30.2453401Z W0507 20:31:30.238000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] self.visit(item) 2025-05-07T20:31:30.2455525Z W0507 20:31:30.238000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:30.2457963Z W0507 20:31:30.238000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:30.2459827Z W0507 20:31:30.238000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:30.2461512Z W0507 20:31:30.238000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:30.2462837Z W0507 20:31:30.238000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^ 2025-05-07T20:31:30.2464652Z W0507 20:31:30.238000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:30.9307663Z W0507 20:31:30.927000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:30.9309515Z W0507 20:31:30.927000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last): 2025-05-07T20:31:30.9311758Z W0507 20:31:30.927000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:30.9314352Z W0507 20:31:30.927000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:30.9316842Z W0507 20:31:30.927000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:30.9319192Z W0507 20:31:30.927000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:30.9321487Z W0507 20:31:30.927000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:30.9323997Z W0507 20:31:30.927000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:30.9326610Z W0507 20:31:30.927000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:30.9328847Z W0507 20:31:30.927000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] generator.visit(fn.parse()) 2025-05-07T20:31:30.9330963Z W0507 20:31:30.927000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:30.9332908Z W0507 20:31:30.927000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ret = super().visit(node) 2025-05-07T20:31:30.9335130Z W0507 20:31:30.927000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:31:30.9336741Z W0507 20:31:30.927000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return visitor(node) 2025-05-07T20:31:30.9338679Z W0507 20:31:30.927000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:30.9340831Z W0507 20:31:30.927000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:30.9342624Z W0507 20:31:30.927000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:31:30.9344281Z W0507 20:31:30.927000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] self.visit(item) 2025-05-07T20:31:30.9346178Z W0507 20:31:30.927000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:30.9348365Z W0507 20:31:30.927000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:30.9350210Z W0507 20:31:30.927000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:30.9351803Z W0507 20:31:30.927000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:30.9353104Z W0507 20:31:30.927000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^ 2025-05-07T20:31:30.9354917Z W0507 20:31:30.927000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:30.9620872Z W0507 20:31:30.958000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:30.9622672Z W0507 20:31:30.958000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last): 2025-05-07T20:31:30.9624963Z W0507 20:31:30.958000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:30.9627445Z W0507 20:31:30.958000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:30.9629857Z W0507 20:31:30.958000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:30.9632289Z W0507 20:31:30.958000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:30.9634563Z W0507 20:31:30.958000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:30.9637012Z W0507 20:31:30.958000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:30.9640010Z W0507 20:31:30.958000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:30.9642159Z W0507 20:31:30.958000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] generator.visit(fn.parse()) 2025-05-07T20:31:30.9645267Z W0507 20:31:30.958000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:30.9647388Z W0507 20:31:30.958000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ret = super().visit(node) 2025-05-07T20:31:30.9649250Z W0507 20:31:30.958000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:31:30.9651026Z W0507 20:31:30.958000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return visitor(node) 2025-05-07T20:31:30.9653109Z W0507 20:31:30.958000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:30.9655334Z W0507 20:31:30.958000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:30.9657252Z W0507 20:31:30.958000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:31:30.9659044Z W0507 20:31:30.958000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] self.visit(item) 2025-05-07T20:31:30.9661207Z W0507 20:31:30.958000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:30.9663593Z W0507 20:31:30.958000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:30.9665393Z W0507 20:31:30.958000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:30.9666960Z W0507 20:31:30.958000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:30.9668209Z W0507 20:31:30.958000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^ 2025-05-07T20:31:30.9670021Z W0507 20:31:30.958000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:34.2477605Z self = 2025-05-07T20:31:34.2478317Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:34.2478605Z 2025-05-07T20:31:34.2478690Z @given( 2025-05-07T20:31:34.2478934Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:34.2479248Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:34.2479562Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:34.2479897Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:34.2480220Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:34.2480505Z ) 2025-05-07T20:31:34.2480863Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:34.2481728Z def test_silu_mul_quant( 2025-05-07T20:31:34.2482114Z self, 2025-05-07T20:31:34.2482325Z T: int, 2025-05-07T20:31:34.2482535Z D: int, 2025-05-07T20:31:34.2482761Z scale_ub: Optional[float], 2025-05-07T20:31:34.2483038Z contiguous: bool, 2025-05-07T20:31:34.2483284Z compiled: bool, 2025-05-07T20:31:34.2483513Z ) -> None: 2025-05-07T20:31:34.2483737Z torch.manual_seed(2025) 2025-05-07T20:31:34.2483994Z 2025-05-07T20:31:34.2484272Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:34.2484624Z 2025-05-07T20:31:34.2484824Z x_sign = torch.sign(x) 2025-05-07T20:31:34.2485116Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:34.2485434Z x = x_sign * x_clamp 2025-05-07T20:31:34.2485685Z x0 = x[:, :D] 2025-05-07T20:31:34.2485905Z x1 = x[:, D:] 2025-05-07T20:31:34.2486128Z 2025-05-07T20:31:34.2486322Z if contiguous: 2025-05-07T20:31:34.2486559Z x0 = x0.contiguous() 2025-05-07T20:31:34.2486833Z x1 = x1.contiguous() 2025-05-07T20:31:34.2487079Z 2025-05-07T20:31:34.2487279Z if scale_ub is not None: 2025-05-07T20:31:34.2487555Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:34.2487896Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:34.2488217Z ) 2025-05-07T20:31:34.2488416Z else: 2025-05-07T20:31:34.2488637Z scale_ub_tensor = None 2025-05-07T20:31:34.2488897Z 2025-05-07T20:31:34.2489134Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:34.2489457Z op = silu_mul_quant 2025-05-07T20:31:34.2489722Z if compiled: 2025-05-07T20:31:34.2490261Z op = torch.compile(op) 2025-05-07T20:31:34.2490570Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:34.2490857Z 2025-05-07T20:31:34.2491054Z > y_fp8, y_scale = fn() 2025-05-07T20:31:34.2491231Z 2025-05-07T20:31:34.2491336Z moe/activation_test.py:117: 2025-05-07T20:31:34.2491646Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:34.2491987Z moe/activation_test.py:115: in fn 2025-05-07T20:31:34.2492270Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:34.2492975Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:34.2493664Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:34.2494209Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:34.2494897Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:34.2495568Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:34.2496105Z kernel = self.compile( 2025-05-07T20:31:34.2496658Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:34.2497322Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:34.2497717Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:34.2497954Z 2025-05-07T20:31:34.2498164Z self = 2025-05-07T20:31:34.2499246Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:34.2501114Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c7f67ab90>} 2025-05-07T20:31:34.2503440Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:34.2504482Z context = 2025-05-07T20:31:34.2504777Z 2025-05-07T20:31:34.2504947Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:34.2505468Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:34.2505938Z module_map=module_map) 2025-05-07T20:31:34.2506301Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:34.2506658Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:34.2506929Z E ^ 2025-05-07T20:31:34.2507386Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:34.2507847Z 2025-05-07T20:31:34.2508268Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:34.2508782Z 2025-05-07T20:31:34.2508887Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:34.2509302Z self=, 2025-05-07T20:31:34.2509694Z T=4096, 2025-05-07T20:31:34.2509882Z D=7168, 2025-05-07T20:31:34.2510080Z scale_ub=None, 2025-05-07T20:31:34.2510292Z contiguous=False, 2025-05-07T20:31:34.2510521Z compiled=False, 2025-05-07T20:31:34.2510730Z ) 2025-05-07T20:31:34.2511042Z self = 2025-05-07T20:31:34.2511543Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:34.2511811Z 2025-05-07T20:31:34.2511897Z @given( 2025-05-07T20:31:34.2512131Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:34.2512446Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:34.2512756Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:34.2513092Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:34.2513413Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:34.2513700Z ) 2025-05-07T20:31:34.2514052Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:34.2514488Z def test_silu_mul_quant( 2025-05-07T20:31:34.2514734Z self, 2025-05-07T20:31:34.2514932Z T: int, 2025-05-07T20:31:34.2515125Z D: int, 2025-05-07T20:31:34.2515349Z scale_ub: Optional[float], 2025-05-07T20:31:34.2515623Z contiguous: bool, 2025-05-07T20:31:34.2515858Z compiled: bool, 2025-05-07T20:31:34.2516081Z ) -> None: 2025-05-07T20:31:34.2516302Z torch.manual_seed(2025) 2025-05-07T20:31:34.2516542Z 2025-05-07T20:31:34.2516823Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:34.2517174Z 2025-05-07T20:31:34.2517371Z x_sign = torch.sign(x) 2025-05-07T20:31:34.2517664Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:34.2517976Z x = x_sign * x_clamp 2025-05-07T20:31:34.2518223Z x0 = x[:, :D] 2025-05-07T20:31:34.2518435Z x1 = x[:, D:] 2025-05-07T20:31:34.2518651Z 2025-05-07T20:31:34.2518840Z if contiguous: 2025-05-07T20:31:34.2519068Z x0 = x0.contiguous() 2025-05-07T20:31:34.2519333Z x1 = x1.contiguous() 2025-05-07T20:31:34.2519583Z 2025-05-07T20:31:34.2519777Z if scale_ub is not None: 2025-05-07T20:31:34.2520054Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:34.2520390Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:34.2520697Z ) 2025-05-07T20:31:34.2520892Z else: 2025-05-07T20:31:34.2521110Z scale_ub_tensor = None 2025-05-07T20:31:34.2521357Z 2025-05-07T20:31:34.2521766Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:34.2522086Z op = silu_mul_quant 2025-05-07T20:31:34.2522419Z if compiled: 2025-05-07T20:31:34.2522679Z op = torch.compile(op) 2025-05-07T20:31:34.2523016Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:34.2523295Z 2025-05-07T20:31:34.2523483Z > y_fp8, y_scale = fn() 2025-05-07T20:31:34.2523652Z 2025-05-07T20:31:34.2523753Z moe/activation_test.py:117: 2025-05-07T20:31:34.2524053Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:34.2524379Z moe/activation_test.py:115: in fn 2025-05-07T20:31:34.2524661Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:34.2525349Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:34.2526045Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:34.2526575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:34.2527276Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:34.2527937Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:34.2528460Z kernel = self.compile( 2025-05-07T20:31:34.2529006Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:34.2529674Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:34.2530069Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:34.2530300Z 2025-05-07T20:31:34.2530506Z self = 2025-05-07T20:31:34.2531578Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:34.2532980Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c84cdf370>} 2025-05-07T20:31:34.2534312Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:34.2535328Z context = 2025-05-07T20:31:34.2535613Z 2025-05-07T20:31:34.2535778Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:34.2536297Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:34.2536762Z module_map=module_map) 2025-05-07T20:31:34.2537127Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:34.2537480Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:34.2537745Z E ^ 2025-05-07T20:31:34.2538208Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:34.2538651Z 2025-05-07T20:31:34.2539066Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:34.2539585Z 2025-05-07T20:31:34.2539691Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:34.2540252Z self=, 2025-05-07T20:31:34.2540833Z T=128, 2025-05-07T20:31:34.2541098Z D=7168, 2025-05-07T20:31:34.2541366Z scale_ub=None, 2025-05-07T20:31:34.2541657Z contiguous=False, 2025-05-07T20:31:34.2541975Z compiled=True, 2025-05-07T20:31:34.2542257Z ) 2025-05-07T20:31:34.3221788Z self = 2025-05-07T20:31:34.3223159Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:34.3223759Z 2025-05-07T20:31:34.3223857Z @given( 2025-05-07T20:31:34.3224099Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:34.3224425Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:34.3224750Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:34.3225104Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:34.3225430Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:34.3225729Z ) 2025-05-07T20:31:34.3226088Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:34.3234245Z def test_silu_mul_quant( 2025-05-07T20:31:34.3234643Z self, 2025-05-07T20:31:34.3234856Z T: int, 2025-05-07T20:31:34.3235081Z D: int, 2025-05-07T20:31:34.3235322Z scale_ub: Optional[float], 2025-05-07T20:31:34.3235620Z contiguous: bool, 2025-05-07T20:31:34.3235880Z compiled: bool, 2025-05-07T20:31:34.3236126Z ) -> None: 2025-05-07T20:31:34.3236372Z torch.manual_seed(2025) 2025-05-07T20:31:34.3236640Z 2025-05-07T20:31:34.3236938Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:34.3237301Z 2025-05-07T20:31:34.3237507Z x_sign = torch.sign(x) 2025-05-07T20:31:34.3237819Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:34.3238148Z x = x_sign * x_clamp 2025-05-07T20:31:34.3238399Z x0 = x[:, :D] 2025-05-07T20:31:34.3238637Z x1 = x[:, D:] 2025-05-07T20:31:34.3238865Z 2025-05-07T20:31:34.3239063Z if contiguous: 2025-05-07T20:31:34.3239312Z x0 = x0.contiguous() 2025-05-07T20:31:34.3239590Z x1 = x1.contiguous() 2025-05-07T20:31:34.3239840Z 2025-05-07T20:31:34.3240048Z if scale_ub is not None: 2025-05-07T20:31:34.3240344Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:34.3240689Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:34.3241016Z ) 2025-05-07T20:31:34.3241229Z else: 2025-05-07T20:31:34.3241450Z scale_ub_tensor = None 2025-05-07T20:31:34.3241716Z 2025-05-07T20:31:34.3241970Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:34.3242292Z op = silu_mul_quant 2025-05-07T20:31:34.3242566Z if compiled: 2025-05-07T20:31:34.3242832Z op = torch.compile(op) 2025-05-07T20:31:34.3243144Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:34.3243430Z 2025-05-07T20:31:34.3243644Z y_fp8, y_scale = fn() 2025-05-07T20:31:34.3243946Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:34.3244244Z 2025-05-07T20:31:34.3244495Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:34.3244843Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:34.3245153Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:34.3245494Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:34.3245867Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:34.3246188Z 2025-05-07T20:31:34.3246407Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:34.3246616Z 2025-05-07T20:31:34.3246725Z moe/activation_test.py:126: 2025-05-07T20:31:34.3247040Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:34.3247385Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:34.3247729Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:34.3248542Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:34.3249303Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:34.3249874Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:34.3250791Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:34.3251491Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:34.3252213Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:34.3253039Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:34.3253794Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:34.3254528Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:34.3255166Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:34.3255776Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:34.3256320Z fn() 2025-05-07T20:31:34.3256843Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:34.3257444Z self.fn.run( 2025-05-07T20:31:34.3257924Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:34.3258470Z kernel = self.compile( 2025-05-07T20:31:34.3259014Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:34.3259674Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:34.3260182Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:34.3260413Z 2025-05-07T20:31:34.3260634Z self = 2025-05-07T20:31:34.3261722Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:34.3263109Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c7f7af9a0>} 2025-05-07T20:31:34.3264442Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:34.3265464Z context = 2025-05-07T20:31:34.3265757Z 2025-05-07T20:31:34.3265926Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:34.3266451Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:34.3266924Z module_map=module_map) 2025-05-07T20:31:34.3267308Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:34.3267675Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:34.3267951Z E ^ 2025-05-07T20:31:34.3268414Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:34.3268864Z 2025-05-07T20:31:34.3269279Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:34.3269789Z 2025-05-07T20:31:34.3269903Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:34.3270314Z self=, 2025-05-07T20:31:34.3270719Z T=128, 2025-05-07T20:31:34.3270924Z D=7168, 2025-05-07T20:31:34.3271131Z scale_ub=None, 2025-05-07T20:31:34.3271353Z contiguous=False, 2025-05-07T20:31:34.3271686Z compiled=False, 2025-05-07T20:31:34.3271903Z ) 2025-05-07T20:31:34.5376286Z self = 2025-05-07T20:31:34.5377097Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:34.5377481Z 2025-05-07T20:31:34.5377591Z @given( 2025-05-07T20:31:34.5377915Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:34.5378300Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:34.5378618Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:34.5378960Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:34.5379289Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:34.5379576Z ) 2025-05-07T20:31:34.5380092Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:34.5380538Z def test_silu_mul_quant( 2025-05-07T20:31:34.5380786Z self, 2025-05-07T20:31:34.5380990Z T: int, 2025-05-07T20:31:34.5381209Z D: int, 2025-05-07T20:31:34.5381434Z scale_ub: Optional[float], 2025-05-07T20:31:34.5381717Z contiguous: bool, 2025-05-07T20:31:34.5381966Z compiled: bool, 2025-05-07T20:31:34.5382193Z ) -> None: 2025-05-07T20:31:34.5382416Z torch.manual_seed(2025) 2025-05-07T20:31:34.5382664Z 2025-05-07T20:31:34.5382944Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:34.5383296Z 2025-05-07T20:31:34.5383494Z x_sign = torch.sign(x) 2025-05-07T20:31:34.5383784Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:34.5384098Z x = x_sign * x_clamp 2025-05-07T20:31:34.5384344Z x0 = x[:, :D] 2025-05-07T20:31:34.5384560Z x1 = x[:, D:] 2025-05-07T20:31:34.5384772Z 2025-05-07T20:31:34.5384966Z if contiguous: 2025-05-07T20:31:34.5385200Z x0 = x0.contiguous() 2025-05-07T20:31:34.5385472Z x1 = x1.contiguous() 2025-05-07T20:31:34.5385722Z 2025-05-07T20:31:34.5385918Z if scale_ub is not None: 2025-05-07T20:31:34.5386197Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:34.5386539Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:34.5386847Z ) 2025-05-07T20:31:34.5387048Z else: 2025-05-07T20:31:34.5387270Z scale_ub_tensor = None 2025-05-07T20:31:34.5387531Z 2025-05-07T20:31:34.5387763Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:34.5388092Z op = silu_mul_quant 2025-05-07T20:31:34.5388354Z if compiled: 2025-05-07T20:31:34.5388606Z op = torch.compile(op) 2025-05-07T20:31:34.5388907Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:34.5389188Z 2025-05-07T20:31:34.5389384Z > y_fp8, y_scale = fn() 2025-05-07T20:31:34.5389560Z 2025-05-07T20:31:34.5389662Z moe/activation_test.py:117: 2025-05-07T20:31:34.5390309Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:34.5390651Z moe/activation_test.py:115: in fn 2025-05-07T20:31:34.5390942Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:34.5391635Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:34.5392331Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:34.5392891Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:34.5393600Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:34.5394263Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:34.5394789Z kernel = self.compile( 2025-05-07T20:31:34.5395331Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:34.5396198Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:34.5396706Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:34.5396935Z 2025-05-07T20:31:34.5397147Z self = 2025-05-07T20:31:34.5398227Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:34.5399600Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c84b2edd0>} 2025-05-07T20:31:34.5400934Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:34.5401975Z context = 2025-05-07T20:31:34.5402261Z 2025-05-07T20:31:34.5402434Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:34.5402959Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:34.5403429Z module_map=module_map) 2025-05-07T20:31:34.5403793Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:34.5404147Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:34.5404411Z E ^ 2025-05-07T20:31:34.5404879Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:34.5405324Z 2025-05-07T20:31:34.5405736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:34.5406252Z 2025-05-07T20:31:34.5406363Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:34.5406776Z self=, 2025-05-07T20:31:34.5407181Z T=4096, 2025-05-07T20:31:34.5407368Z D=5120, 2025-05-07T20:31:34.5407571Z scale_ub=1200.0, 2025-05-07T20:31:34.5407799Z contiguous=True, 2025-05-07T20:31:34.5408022Z compiled=False, 2025-05-07T20:31:34.5408237Z ) 2025-05-07T20:31:34.5408564Z self = 2025-05-07T20:31:34.5409053Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:34.5409331Z 2025-05-07T20:31:34.5409413Z @given( 2025-05-07T20:31:34.5409648Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:34.5409956Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:34.5410268Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:34.5410600Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:34.5410940Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:34.5411221Z ) 2025-05-07T20:31:34.5411581Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:34.5412028Z def test_silu_mul_quant( 2025-05-07T20:31:34.5412269Z self, 2025-05-07T20:31:34.5412472Z T: int, 2025-05-07T20:31:34.5412675Z D: int, 2025-05-07T20:31:34.5412899Z scale_ub: Optional[float], 2025-05-07T20:31:34.5413180Z contiguous: bool, 2025-05-07T20:31:34.5413425Z compiled: bool, 2025-05-07T20:31:34.5413650Z ) -> None: 2025-05-07T20:31:34.5413872Z torch.manual_seed(2025) 2025-05-07T20:31:34.5414123Z 2025-05-07T20:31:34.5414396Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:34.5414738Z 2025-05-07T20:31:34.5414936Z x_sign = torch.sign(x) 2025-05-07T20:31:34.5415225Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:34.5415632Z x = x_sign * x_clamp 2025-05-07T20:31:34.5415876Z x0 = x[:, :D] 2025-05-07T20:31:34.5416096Z x1 = x[:, D:] 2025-05-07T20:31:34.5416445Z 2025-05-07T20:31:34.5416638Z if contiguous: 2025-05-07T20:31:34.5416875Z x0 = x0.contiguous() 2025-05-07T20:31:34.5417137Z x1 = x1.contiguous() 2025-05-07T20:31:34.5417383Z 2025-05-07T20:31:34.5417578Z if scale_ub is not None: 2025-05-07T20:31:34.5417849Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:34.5418185Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:34.5418501Z ) 2025-05-07T20:31:34.5418693Z else: 2025-05-07T20:31:34.5418911Z scale_ub_tensor = None 2025-05-07T20:31:34.5419167Z 2025-05-07T20:31:34.5419396Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:34.5419712Z op = silu_mul_quant 2025-05-07T20:31:34.5420064Z if compiled: 2025-05-07T20:31:34.5420312Z op = torch.compile(op) 2025-05-07T20:31:34.5420618Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:34.5420895Z 2025-05-07T20:31:34.5421095Z > y_fp8, y_scale = fn() 2025-05-07T20:31:34.5421266Z 2025-05-07T20:31:34.5421369Z moe/activation_test.py:117: 2025-05-07T20:31:34.5421668Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:34.5422001Z moe/activation_test.py:115: in fn 2025-05-07T20:31:34.5422286Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:34.5422976Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:34.5423687Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:34.5424227Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:34.5424898Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:34.5425568Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:34.5426115Z kernel = self.compile( 2025-05-07T20:31:34.5426660Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:34.5427308Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:34.5427708Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:34.5427934Z 2025-05-07T20:31:34.5428150Z self = 2025-05-07T20:31:34.5429223Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:34.5430596Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c7c14c940>} 2025-05-07T20:31:34.5431940Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:34.5432962Z context = 2025-05-07T20:31:34.5433248Z 2025-05-07T20:31:34.5433422Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:34.5433936Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:34.5434404Z module_map=module_map) 2025-05-07T20:31:34.5434772Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:34.5435126Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:34.5435383Z E ^ 2025-05-07T20:31:34.5435846Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:34.5436385Z 2025-05-07T20:31:34.5436877Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:34.5437396Z 2025-05-07T20:31:34.5437502Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:34.5437916Z self=, 2025-05-07T20:31:34.5438318Z T=1, 2025-05-07T20:31:34.5438508Z D=5120, 2025-05-07T20:31:34.5438702Z scale_ub=None, 2025-05-07T20:31:34.5438927Z contiguous=True, 2025-05-07T20:31:34.5439155Z compiled=True, 2025-05-07T20:31:34.5439360Z ) 2025-05-07T20:31:35.0182215Z W0507 20:31:35.014000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:35.0183584Z W0507 20:31:35.014000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last): 2025-05-07T20:31:35.0184971Z W0507 20:31:35.014000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:35.0186405Z W0507 20:31:35.014000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:35.0187787Z W0507 20:31:35.014000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:35.0189181Z W0507 20:31:35.014000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:35.0190793Z W0507 20:31:35.014000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:35.0192178Z W0507 20:31:35.014000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:35.0193578Z W0507 20:31:35.014000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:35.0194817Z W0507 20:31:35.014000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] generator.visit(fn.parse()) 2025-05-07T20:31:35.0196035Z W0507 20:31:35.014000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:35.0197242Z W0507 20:31:35.014000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ret = super().visit(node) 2025-05-07T20:31:35.0198273Z W0507 20:31:35.014000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:31:35.0199281Z W0507 20:31:35.014000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return visitor(node) 2025-05-07T20:31:35.0200493Z W0507 20:31:35.014000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:35.0202137Z W0507 20:31:35.014000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:35.0203253Z W0507 20:31:35.014000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:31:35.0204281Z W0507 20:31:35.014000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] self.visit(item) 2025-05-07T20:31:35.0205451Z W0507 20:31:35.014000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:35.0206800Z W0507 20:31:35.014000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:35.0207869Z W0507 20:31:35.014000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:35.0208781Z W0507 20:31:35.014000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:35.0209514Z W0507 20:31:35.014000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^ 2025-05-07T20:31:35.0210529Z W0507 20:31:35.014000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:35.1818804Z W0507 20:31:35.178000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:35.1820538Z W0507 20:31:35.178000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last): 2025-05-07T20:31:35.1822565Z W0507 20:31:35.178000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:35.1824141Z W0507 20:31:35.178000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:35.1825504Z W0507 20:31:35.178000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:35.1826873Z W0507 20:31:35.178000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:35.1828179Z W0507 20:31:35.178000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:35.1829542Z W0507 20:31:35.178000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:35.1830947Z W0507 20:31:35.178000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:35.1832179Z W0507 20:31:35.178000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] generator.visit(fn.parse()) 2025-05-07T20:31:35.1833397Z W0507 20:31:35.178000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:35.1835072Z W0507 20:31:35.178000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ret = super().visit(node) 2025-05-07T20:31:35.1836111Z W0507 20:31:35.178000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:31:35.1837125Z W0507 20:31:35.178000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return visitor(node) 2025-05-07T20:31:35.1838329Z W0507 20:31:35.178000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:35.1839597Z W0507 20:31:35.178000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:35.1840715Z W0507 20:31:35.178000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:31:35.1841750Z W0507 20:31:35.178000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] self.visit(item) 2025-05-07T20:31:35.1842913Z W0507 20:31:35.178000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:35.1844259Z W0507 20:31:35.178000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:35.1845316Z W0507 20:31:35.178000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:35.1846231Z W0507 20:31:35.178000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:35.1846968Z W0507 20:31:35.178000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^ 2025-05-07T20:31:35.1847976Z W0507 20:31:35.178000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:35.6318153Z W0507 20:31:35.628000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:35.6319489Z W0507 20:31:35.628000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last): 2025-05-07T20:31:35.6320853Z W0507 20:31:35.628000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:35.6322270Z W0507 20:31:35.628000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:35.6323637Z W0507 20:31:35.628000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:35.6324998Z W0507 20:31:35.628000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:35.6326653Z W0507 20:31:35.628000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:35.6328162Z W0507 20:31:35.628000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:35.6329556Z W0507 20:31:35.628000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:35.6330790Z W0507 20:31:35.628000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] generator.visit(fn.parse()) 2025-05-07T20:31:35.6332000Z W0507 20:31:35.628000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:35.6333201Z W0507 20:31:35.628000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ret = super().visit(node) 2025-05-07T20:31:35.6334231Z W0507 20:31:35.628000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:31:35.6335239Z W0507 20:31:35.628000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return visitor(node) 2025-05-07T20:31:35.6336448Z W0507 20:31:35.628000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:35.6337704Z W0507 20:31:35.628000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:35.6338816Z W0507 20:31:35.628000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:31:35.6339939Z W0507 20:31:35.628000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] self.visit(item) 2025-05-07T20:31:35.6341109Z W0507 20:31:35.628000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:35.6342451Z W0507 20:31:35.628000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:35.6343496Z W0507 20:31:35.628000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:35.6344413Z W0507 20:31:35.628000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:35.6345149Z W0507 20:31:35.628000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^ 2025-05-07T20:31:35.6346155Z W0507 20:31:35.628000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:35.6608857Z W0507 20:31:35.657000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:35.6610172Z W0507 20:31:35.657000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last): 2025-05-07T20:31:35.6611695Z W0507 20:31:35.657000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:35.6613208Z W0507 20:31:35.657000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:35.6614566Z W0507 20:31:35.657000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:35.6615922Z W0507 20:31:35.657000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:35.6617213Z W0507 20:31:35.657000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:35.6618894Z W0507 20:31:35.657000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:35.6620374Z W0507 20:31:35.657000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:35.6621606Z W0507 20:31:35.657000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] generator.visit(fn.parse()) 2025-05-07T20:31:35.6622805Z W0507 20:31:35.657000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:35.6624012Z W0507 20:31:35.657000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ret = super().visit(node) 2025-05-07T20:31:35.6625024Z W0507 20:31:35.657000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:31:35.6626036Z W0507 20:31:35.657000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return visitor(node) 2025-05-07T20:31:35.6627237Z W0507 20:31:35.657000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:35.6628509Z W0507 20:31:35.657000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:35.6629617Z W0507 20:31:35.657000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:31:35.6630648Z W0507 20:31:35.657000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] self.visit(item) 2025-05-07T20:31:35.6631807Z W0507 20:31:35.657000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:35.6633141Z W0507 20:31:35.657000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:35.6634179Z W0507 20:31:35.657000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:35.6635174Z W0507 20:31:35.657000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:35.6635979Z W0507 20:31:35.657000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^ 2025-05-07T20:31:35.6636989Z W0507 20:31:35.657000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:35.9690748Z self = 2025-05-07T20:31:35.9691488Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:35.9691866Z 2025-05-07T20:31:35.9692022Z @given( 2025-05-07T20:31:35.9692351Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:35.9692790Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:35.9693256Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:35.9693707Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:35.9694048Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:35.9694333Z ) 2025-05-07T20:31:35.9702246Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:35.9702834Z def test_silu_mul_quant( 2025-05-07T20:31:35.9703093Z self, 2025-05-07T20:31:35.9703292Z T: int, 2025-05-07T20:31:35.9703501Z D: int, 2025-05-07T20:31:35.9703732Z scale_ub: Optional[float], 2025-05-07T20:31:35.9704007Z contiguous: bool, 2025-05-07T20:31:35.9704260Z compiled: bool, 2025-05-07T20:31:35.9704495Z ) -> None: 2025-05-07T20:31:35.9704724Z torch.manual_seed(2025) 2025-05-07T20:31:35.9704973Z 2025-05-07T20:31:35.9705257Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:35.9705606Z 2025-05-07T20:31:35.9705800Z x_sign = torch.sign(x) 2025-05-07T20:31:35.9706117Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:35.9706433Z x = x_sign * x_clamp 2025-05-07T20:31:35.9706684Z x0 = x[:, :D] 2025-05-07T20:31:35.9706911Z x1 = x[:, D:] 2025-05-07T20:31:35.9707127Z 2025-05-07T20:31:35.9707318Z if contiguous: 2025-05-07T20:31:35.9707559Z x0 = x0.contiguous() 2025-05-07T20:31:35.9707824Z x1 = x1.contiguous() 2025-05-07T20:31:35.9708063Z 2025-05-07T20:31:35.9708263Z if scale_ub is not None: 2025-05-07T20:31:35.9708542Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:35.9708875Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:35.9709189Z ) 2025-05-07T20:31:35.9709389Z else: 2025-05-07T20:31:35.9709602Z scale_ub_tensor = None 2025-05-07T20:31:35.9709863Z 2025-05-07T20:31:35.9710105Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:35.9710425Z op = silu_mul_quant 2025-05-07T20:31:35.9710680Z if compiled: 2025-05-07T20:31:35.9710937Z op = torch.compile(op) 2025-05-07T20:31:35.9711253Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:35.9711529Z 2025-05-07T20:31:35.9711735Z y_fp8, y_scale = fn() 2025-05-07T20:31:35.9712027Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:35.9712314Z 2025-05-07T20:31:35.9712559Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:35.9712902Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:35.9713193Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:35.9713511Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:35.9713872Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:35.9714187Z 2025-05-07T20:31:35.9714392Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:35.9714595Z 2025-05-07T20:31:35.9714699Z moe/activation_test.py:126: 2025-05-07T20:31:35.9715353Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:35.9715819Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:35.9716163Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:35.9716958Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:35.9717716Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:35.9718266Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:35.9718957Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:35.9719647Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:35.9720365Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:35.9721130Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:35.9721886Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:35.9722616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:35.9723264Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:35.9723906Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:35.9724438Z fn() 2025-05-07T20:31:35.9724960Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:35.9725537Z self.fn.run( 2025-05-07T20:31:35.9726011Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:35.9726559Z kernel = self.compile( 2025-05-07T20:31:35.9727101Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:35.9727757Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:35.9728156Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:35.9728408Z 2025-05-07T20:31:35.9728630Z self = 2025-05-07T20:31:35.9729707Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:35.9731102Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c7e5bea70>} 2025-05-07T20:31:35.9732445Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:35.9733461Z context = 2025-05-07T20:31:35.9733754Z 2025-05-07T20:31:35.9733924Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:35.9734449Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:35.9734914Z module_map=module_map) 2025-05-07T20:31:35.9735281Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:35.9735641Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:35.9735909Z E ^ 2025-05-07T20:31:35.9736369Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:35.9736918Z 2025-05-07T20:31:35.9737410Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:35.9737925Z 2025-05-07T20:31:35.9738032Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:35.9738445Z self=, 2025-05-07T20:31:35.9738841Z T=2048, 2025-05-07T20:31:35.9739035Z D=5120, 2025-05-07T20:31:35.9739232Z scale_ub=None, 2025-05-07T20:31:35.9739444Z contiguous=True, 2025-05-07T20:31:35.9739672Z compiled=True, 2025-05-07T20:31:35.9739969Z ) 2025-05-07T20:31:36.4161909Z W0507 20:31:36.412000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:36.4163019Z W0507 20:31:36.412000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last): 2025-05-07T20:31:36.4164384Z W0507 20:31:36.412000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:36.4165804Z W0507 20:31:36.412000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:36.4167177Z W0507 20:31:36.412000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:36.4168558Z W0507 20:31:36.412000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:36.4169883Z W0507 20:31:36.412000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:36.4171250Z W0507 20:31:36.412000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:36.4172662Z W0507 20:31:36.412000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:36.4173903Z W0507 20:31:36.412000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] generator.visit(fn.parse()) 2025-05-07T20:31:36.4175113Z W0507 20:31:36.412000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:36.4176318Z W0507 20:31:36.412000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ret = super().visit(node) 2025-05-07T20:31:36.4177342Z W0507 20:31:36.412000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:31:36.4178340Z W0507 20:31:36.412000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return visitor(node) 2025-05-07T20:31:36.4179547Z W0507 20:31:36.412000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:36.4180916Z W0507 20:31:36.412000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:36.4182592Z W0507 20:31:36.412000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:31:36.4183624Z W0507 20:31:36.412000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] self.visit(item) 2025-05-07T20:31:36.4184783Z W0507 20:31:36.412000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:36.4186121Z W0507 20:31:36.412000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:36.4187172Z W0507 20:31:36.412000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:36.4188081Z W0507 20:31:36.412000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:36.4188812Z W0507 20:31:36.412000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^ 2025-05-07T20:31:36.4190081Z W0507 20:31:36.412000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:36.5788272Z W0507 20:31:36.575000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:36.5789431Z W0507 20:31:36.575000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last): 2025-05-07T20:31:36.5791080Z W0507 20:31:36.575000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:36.5792505Z W0507 20:31:36.575000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:36.5793862Z W0507 20:31:36.575000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:36.5795222Z W0507 20:31:36.575000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:36.5796518Z W0507 20:31:36.575000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:36.5797883Z W0507 20:31:36.575000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:36.5799272Z W0507 20:31:36.575000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:36.5800503Z W0507 20:31:36.575000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] generator.visit(fn.parse()) 2025-05-07T20:31:36.5801715Z W0507 20:31:36.575000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:36.5803339Z W0507 20:31:36.575000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ret = super().visit(node) 2025-05-07T20:31:36.5804368Z W0507 20:31:36.575000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:31:36.5805365Z W0507 20:31:36.575000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return visitor(node) 2025-05-07T20:31:36.5806567Z W0507 20:31:36.575000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:36.5807828Z W0507 20:31:36.575000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:36.5808940Z W0507 20:31:36.575000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:31:36.5809976Z W0507 20:31:36.575000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] self.visit(item) 2025-05-07T20:31:36.5811132Z W0507 20:31:36.575000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:36.5812473Z W0507 20:31:36.575000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:36.5813555Z W0507 20:31:36.575000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:36.5814472Z W0507 20:31:36.575000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:36.5815211Z W0507 20:31:36.575000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^ 2025-05-07T20:31:36.5816222Z W0507 20:31:36.575000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:37.0230856Z W0507 20:31:37.019000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:37.0231932Z W0507 20:31:37.019000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last): 2025-05-07T20:31:37.0233276Z W0507 20:31:37.019000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:37.0234796Z W0507 20:31:37.019000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:37.0236156Z W0507 20:31:37.019000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:37.0237521Z W0507 20:31:37.019000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:37.0238804Z W0507 20:31:37.019000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:37.0240582Z W0507 20:31:37.019000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:37.0242002Z W0507 20:31:37.019000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:37.0243250Z W0507 20:31:37.019000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] generator.visit(fn.parse()) 2025-05-07T20:31:37.0244458Z W0507 20:31:37.019000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:37.0245677Z W0507 20:31:37.019000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ret = super().visit(node) 2025-05-07T20:31:37.0246702Z W0507 20:31:37.019000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:31:37.0247709Z W0507 20:31:37.019000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return visitor(node) 2025-05-07T20:31:37.0248907Z W0507 20:31:37.019000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:37.0250168Z W0507 20:31:37.019000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:37.0251263Z W0507 20:31:37.019000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:31:37.0252306Z W0507 20:31:37.019000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] self.visit(item) 2025-05-07T20:31:37.0253467Z W0507 20:31:37.019000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:37.0254820Z W0507 20:31:37.019000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:37.0255864Z W0507 20:31:37.019000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:37.0256754Z W0507 20:31:37.019000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:37.0257495Z W0507 20:31:37.019000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^ 2025-05-07T20:31:37.0258510Z W0507 20:31:37.019000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:37.0529342Z W0507 20:31:37.049000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:37.0530394Z W0507 20:31:37.049000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last): 2025-05-07T20:31:37.0531698Z W0507 20:31:37.049000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:37.0533433Z W0507 20:31:37.049000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:37.0534799Z W0507 20:31:37.049000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:37.0536170Z W0507 20:31:37.049000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:37.0537458Z W0507 20:31:37.049000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:37.0538816Z W0507 20:31:37.049000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:37.0540378Z W0507 20:31:37.049000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:37.0541607Z W0507 20:31:37.049000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] generator.visit(fn.parse()) 2025-05-07T20:31:37.0542805Z W0507 20:31:37.049000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:37.0543999Z W0507 20:31:37.049000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ret = super().visit(node) 2025-05-07T20:31:37.0545034Z W0507 20:31:37.049000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:31:37.0546038Z W0507 20:31:37.049000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return visitor(node) 2025-05-07T20:31:37.0547236Z W0507 20:31:37.049000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:37.0548510Z W0507 20:31:37.049000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:37.0549606Z W0507 20:31:37.049000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:31:37.0550644Z W0507 20:31:37.049000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] self.visit(item) 2025-05-07T20:31:37.0551802Z W0507 20:31:37.049000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:37.0553160Z W0507 20:31:37.049000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:37.0554202Z W0507 20:31:37.049000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:37.0555100Z W0507 20:31:37.049000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:37.0556170Z W0507 20:31:37.049000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^ 2025-05-07T20:31:37.0557196Z W0507 20:31:37.049000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:37.5098498Z self = 2025-05-07T20:31:37.5099270Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:37.5099546Z 2025-05-07T20:31:37.5099628Z @given( 2025-05-07T20:31:37.5099985Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:37.5100300Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:37.5100610Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:37.5100945Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:37.5101300Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:37.5101580Z ) 2025-05-07T20:31:37.5101943Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:37.5102388Z def test_silu_mul_quant( 2025-05-07T20:31:37.5102630Z self, 2025-05-07T20:31:37.5102829Z T: int, 2025-05-07T20:31:37.5103031Z D: int, 2025-05-07T20:31:37.5103249Z scale_ub: Optional[float], 2025-05-07T20:31:37.5103525Z contiguous: bool, 2025-05-07T20:31:37.5103800Z compiled: bool, 2025-05-07T20:31:37.5104052Z ) -> None: 2025-05-07T20:31:37.5104272Z torch.manual_seed(2025) 2025-05-07T20:31:37.5104520Z 2025-05-07T20:31:37.5104791Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:37.5105133Z 2025-05-07T20:31:37.5105328Z x_sign = torch.sign(x) 2025-05-07T20:31:37.5105615Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:37.5105924Z x = x_sign * x_clamp 2025-05-07T20:31:37.5106176Z x0 = x[:, :D] 2025-05-07T20:31:37.5106392Z x1 = x[:, D:] 2025-05-07T20:31:37.5106596Z 2025-05-07T20:31:37.5106792Z if contiguous: 2025-05-07T20:31:37.5107027Z x0 = x0.contiguous() 2025-05-07T20:31:37.5107286Z x1 = x1.contiguous() 2025-05-07T20:31:37.5107528Z 2025-05-07T20:31:37.5107721Z if scale_ub is not None: 2025-05-07T20:31:37.5107991Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:37.5108328Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:37.5108643Z ) 2025-05-07T20:31:37.5108832Z else: 2025-05-07T20:31:37.5109048Z scale_ub_tensor = None 2025-05-07T20:31:37.5109301Z 2025-05-07T20:31:37.5109530Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:37.5109847Z op = silu_mul_quant 2025-05-07T20:31:37.5110107Z if compiled: 2025-05-07T20:31:37.5110351Z op = torch.compile(op) 2025-05-07T20:31:37.5110654Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:37.5110930Z 2025-05-07T20:31:37.5111129Z y_fp8, y_scale = fn() 2025-05-07T20:31:37.5111413Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:37.5111703Z 2025-05-07T20:31:37.5111954Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:37.5112291Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:37.5112582Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:37.5112898Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:37.5113260Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:37.5113573Z 2025-05-07T20:31:37.5113773Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:37.5113973Z 2025-05-07T20:31:37.5114077Z moe/activation_test.py:126: 2025-05-07T20:31:37.5114378Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:37.5115057Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:37.5115390Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:37.5116401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:37.5117167Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:37.5117706Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:37.5118388Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:37.5119071Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:37.5119782Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:37.5120546Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:37.5121302Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:37.5122027Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:37.5122659Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:37.5123256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:37.5123773Z fn() 2025-05-07T20:31:37.5124280Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:37.5124847Z self.fn.run( 2025-05-07T20:31:37.5125311Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:37.5125837Z kernel = self.compile( 2025-05-07T20:31:37.5126375Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:37.5127036Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:37.5127432Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:37.5127657Z 2025-05-07T20:31:37.5127875Z self = 2025-05-07T20:31:37.5128942Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:37.5130319Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c6c9f8dc0>} 2025-05-07T20:31:37.5131663Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:37.5132697Z context = 2025-05-07T20:31:37.5132983Z 2025-05-07T20:31:37.5133157Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:37.5133673Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:37.5134137Z module_map=module_map) 2025-05-07T20:31:37.5134506Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:37.5134858Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:37.5135130Z E ^ 2025-05-07T20:31:37.5135595Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:37.5136038Z 2025-05-07T20:31:37.5136456Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:37.5137048Z 2025-05-07T20:31:37.5137153Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:37.5137646Z self=, 2025-05-07T20:31:37.5138050Z T=128, 2025-05-07T20:31:37.5138232Z D=5120, 2025-05-07T20:31:37.5138431Z scale_ub=None, 2025-05-07T20:31:37.5138655Z contiguous=True, 2025-05-07T20:31:37.5138874Z compiled=True, 2025-05-07T20:31:37.5139087Z ) 2025-05-07T20:31:37.9828019Z W0507 20:31:37.979000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:37.9829113Z W0507 20:31:37.979000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last): 2025-05-07T20:31:37.9830471Z W0507 20:31:37.979000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:37.9831912Z W0507 20:31:37.979000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:37.9833274Z W0507 20:31:37.979000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:37.9834652Z W0507 20:31:37.979000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:37.9835945Z W0507 20:31:37.979000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:37.9837318Z W0507 20:31:37.979000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:37.9838704Z W0507 20:31:37.979000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:37.9839943Z W0507 20:31:37.979000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] generator.visit(fn.parse()) 2025-05-07T20:31:37.9841148Z W0507 20:31:37.979000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:37.9842341Z W0507 20:31:37.979000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ret = super().visit(node) 2025-05-07T20:31:37.9843367Z W0507 20:31:37.979000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:31:37.9844364Z W0507 20:31:37.979000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return visitor(node) 2025-05-07T20:31:37.9845564Z W0507 20:31:37.979000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:37.9846823Z W0507 20:31:37.979000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:37.9848260Z W0507 20:31:37.979000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:31:37.9849427Z W0507 20:31:37.979000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] self.visit(item) 2025-05-07T20:31:37.9850585Z W0507 20:31:37.979000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:37.9851918Z W0507 20:31:37.979000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:37.9852958Z W0507 20:31:37.979000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:37.9853857Z W0507 20:31:37.979000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:37.9854597Z W0507 20:31:37.979000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^ 2025-05-07T20:31:37.9855599Z W0507 20:31:37.979000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:38.1463486Z W0507 20:31:38.143000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:38.1464606Z W0507 20:31:38.143000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last): 2025-05-07T20:31:38.1465927Z W0507 20:31:38.143000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:38.1467395Z W0507 20:31:38.143000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:38.1468769Z W0507 20:31:38.143000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:38.1470147Z W0507 20:31:38.143000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:38.1471454Z W0507 20:31:38.143000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:38.1472830Z W0507 20:31:38.143000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:38.1474226Z W0507 20:31:38.143000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:38.1475463Z W0507 20:31:38.143000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] generator.visit(fn.parse()) 2025-05-07T20:31:38.1476670Z W0507 20:31:38.143000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:38.1477870Z W0507 20:31:38.143000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ret = super().visit(node) 2025-05-07T20:31:38.1479378Z W0507 20:31:38.143000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:31:38.1480400Z W0507 20:31:38.143000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return visitor(node) 2025-05-07T20:31:38.1481596Z W0507 20:31:38.143000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:38.1482866Z W0507 20:31:38.143000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:38.1484015Z W0507 20:31:38.143000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:31:38.1485054Z W0507 20:31:38.143000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] self.visit(item) 2025-05-07T20:31:38.1486207Z W0507 20:31:38.143000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:38.1487549Z W0507 20:31:38.143000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:38.1488594Z W0507 20:31:38.143000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:38.1489495Z W0507 20:31:38.143000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:38.1490515Z W0507 20:31:38.143000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^ 2025-05-07T20:31:38.1491515Z W0507 20:31:38.143000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:38.5958178Z W0507 20:31:38.592000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:38.5959247Z W0507 20:31:38.592000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last): 2025-05-07T20:31:38.5960571Z W0507 20:31:38.592000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:38.5962009Z W0507 20:31:38.592000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:38.5963370Z W0507 20:31:38.592000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:38.5964735Z W0507 20:31:38.592000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:38.5966043Z W0507 20:31:38.592000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:38.5967751Z W0507 20:31:38.592000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:38.5969312Z W0507 20:31:38.592000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:38.5970541Z W0507 20:31:38.592000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] generator.visit(fn.parse()) 2025-05-07T20:31:38.5971745Z W0507 20:31:38.592000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:38.5980230Z W0507 20:31:38.592000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ret = super().visit(node) 2025-05-07T20:31:38.5981459Z W0507 20:31:38.592000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:31:38.5982494Z W0507 20:31:38.592000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return visitor(node) 2025-05-07T20:31:38.5983713Z W0507 20:31:38.592000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:38.5984997Z W0507 20:31:38.592000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:38.5986116Z W0507 20:31:38.592000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:31:38.5987163Z W0507 20:31:38.592000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] self.visit(item) 2025-05-07T20:31:38.5988339Z W0507 20:31:38.592000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:38.5989684Z W0507 20:31:38.592000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:38.5991100Z W0507 20:31:38.592000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:38.5992011Z W0507 20:31:38.592000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:38.5992756Z W0507 20:31:38.592000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^ 2025-05-07T20:31:38.5993770Z W0507 20:31:38.592000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:38.6259458Z W0507 20:31:38.622000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:38.6260636Z W0507 20:31:38.622000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last): 2025-05-07T20:31:38.6261956Z W0507 20:31:38.622000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:38.6264182Z W0507 20:31:38.622000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:38.6265565Z W0507 20:31:38.622000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:38.6266937Z W0507 20:31:38.622000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:38.6268234Z W0507 20:31:38.622000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:38.6269598Z W0507 20:31:38.622000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:38.6271007Z W0507 20:31:38.622000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:38.6272235Z W0507 20:31:38.622000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] generator.visit(fn.parse()) 2025-05-07T20:31:38.6273433Z W0507 20:31:38.622000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:38.6274625Z W0507 20:31:38.622000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ret = super().visit(node) 2025-05-07T20:31:38.6275652Z W0507 20:31:38.622000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:31:38.6276664Z W0507 20:31:38.622000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return visitor(node) 2025-05-07T20:31:38.6277872Z W0507 20:31:38.622000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:38.6279134Z W0507 20:31:38.622000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:38.6280234Z W0507 20:31:38.622000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:31:38.6281260Z W0507 20:31:38.622000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] self.visit(item) 2025-05-07T20:31:38.6282421Z W0507 20:31:38.622000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:38.6283756Z W0507 20:31:38.622000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:38.6284813Z W0507 20:31:38.622000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:38.6285712Z W0507 20:31:38.622000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:38.6286448Z W0507 20:31:38.622000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^ 2025-05-07T20:31:38.6287601Z W0507 20:31:38.622000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:39.0383924Z self = 2025-05-07T20:31:39.0384531Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:39.0384805Z 2025-05-07T20:31:39.0384885Z @given( 2025-05-07T20:31:39.0385132Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:39.0385447Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:39.0385766Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:39.0386105Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:39.0386440Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:39.0386722Z ) 2025-05-07T20:31:39.0387079Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:39.0387559Z def test_silu_mul_quant( 2025-05-07T20:31:39.0387812Z self, 2025-05-07T20:31:39.0388026Z T: int, 2025-05-07T20:31:39.0388232Z D: int, 2025-05-07T20:31:39.0388455Z scale_ub: Optional[float], 2025-05-07T20:31:39.0388736Z contiguous: bool, 2025-05-07T20:31:39.0388983Z compiled: bool, 2025-05-07T20:31:39.0389213Z ) -> None: 2025-05-07T20:31:39.0389439Z torch.manual_seed(2025) 2025-05-07T20:31:39.0389688Z 2025-05-07T20:31:39.0390249Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:39.0390604Z 2025-05-07T20:31:39.0390808Z x_sign = torch.sign(x) 2025-05-07T20:31:39.0391100Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:39.0391428Z x = x_sign * x_clamp 2025-05-07T20:31:39.0391688Z x0 = x[:, :D] 2025-05-07T20:31:39.0391914Z x1 = x[:, D:] 2025-05-07T20:31:39.0392139Z 2025-05-07T20:31:39.0392326Z if contiguous: 2025-05-07T20:31:39.0392568Z x0 = x0.contiguous() 2025-05-07T20:31:39.0392842Z x1 = x1.contiguous() 2025-05-07T20:31:39.0393082Z 2025-05-07T20:31:39.0393283Z if scale_ub is not None: 2025-05-07T20:31:39.0393565Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:39.0393900Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:39.0394244Z ) 2025-05-07T20:31:39.0394467Z else: 2025-05-07T20:31:39.0394679Z scale_ub_tensor = None 2025-05-07T20:31:39.0394935Z 2025-05-07T20:31:39.0395175Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:39.0395491Z op = silu_mul_quant 2025-05-07T20:31:39.0395752Z if compiled: 2025-05-07T20:31:39.0396011Z op = torch.compile(op) 2025-05-07T20:31:39.0396311Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:39.0396596Z 2025-05-07T20:31:39.0396797Z y_fp8, y_scale = fn() 2025-05-07T20:31:39.0397097Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:39.0397390Z 2025-05-07T20:31:39.0397635Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:39.0397979Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:39.0398275Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:39.0398596Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:39.0398958Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:39.0399268Z 2025-05-07T20:31:39.0399473Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:39.0399667Z 2025-05-07T20:31:39.0399778Z moe/activation_test.py:126: 2025-05-07T20:31:39.0400085Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:39.0400418Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:39.0400755Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:39.0402072Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:39.0402824Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:39.0403374Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:39.0404056Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:39.0404742Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:39.0405457Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:39.0406212Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:39.0406962Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:39.0407701Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:39.0408337Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:39.0408937Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:39.0409466Z fn() 2025-05-07T20:31:39.0409971Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:39.0410560Z self.fn.run( 2025-05-07T20:31:39.0411029Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:39.0411562Z kernel = self.compile( 2025-05-07T20:31:39.0412097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:39.0412758Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:39.0413165Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:39.0413394Z 2025-05-07T20:31:39.0413603Z self = 2025-05-07T20:31:39.0414684Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:39.0416079Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c564dcb80>} 2025-05-07T20:31:39.0417409Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:39.0418436Z context = 2025-05-07T20:31:39.0418724Z 2025-05-07T20:31:39.0418898Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:39.0419421Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:39.0420034Z module_map=module_map) 2025-05-07T20:31:39.0420407Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:39.0420759Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:39.0421031Z E ^ 2025-05-07T20:31:39.0421496Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:39.0421939Z 2025-05-07T20:31:39.0422351Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:39.0422874Z 2025-05-07T20:31:39.0423077Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:39.0423495Z self=, 2025-05-07T20:31:39.0424013Z T=4096, 2025-05-07T20:31:39.0424229Z D=5120, 2025-05-07T20:31:39.0424459Z scale_ub=None, 2025-05-07T20:31:39.0424684Z contiguous=True, 2025-05-07T20:31:39.0424907Z compiled=True, 2025-05-07T20:31:39.0425125Z ) 2025-05-07T20:31:39.5157767Z W0507 20:31:39.512000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:39.5158841Z W0507 20:31:39.512000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last): 2025-05-07T20:31:39.5160175Z W0507 20:31:39.512000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:39.5161625Z W0507 20:31:39.512000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:39.5163001Z W0507 20:31:39.512000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:39.5164381Z W0507 20:31:39.512000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:39.5165680Z W0507 20:31:39.512000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:39.5167050Z W0507 20:31:39.512000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:39.5168463Z W0507 20:31:39.512000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:39.5169718Z W0507 20:31:39.512000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] generator.visit(fn.parse()) 2025-05-07T20:31:39.5170927Z W0507 20:31:39.512000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:39.5172131Z W0507 20:31:39.512000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ret = super().visit(node) 2025-05-07T20:31:39.5173181Z W0507 20:31:39.512000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:31:39.5174212Z W0507 20:31:39.512000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return visitor(node) 2025-05-07T20:31:39.5175422Z W0507 20:31:39.512000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:39.5176699Z W0507 20:31:39.512000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:39.5177809Z W0507 20:31:39.512000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:31:39.5179217Z W0507 20:31:39.512000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] self.visit(item) 2025-05-07T20:31:39.5180520Z W0507 20:31:39.512000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:39.5181876Z W0507 20:31:39.512000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:39.5182931Z W0507 20:31:39.512000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:39.5183839Z W0507 20:31:39.512000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:39.5184591Z W0507 20:31:39.512000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^ 2025-05-07T20:31:39.5185608Z W0507 20:31:39.512000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:39.6802219Z W0507 20:31:39.676000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:39.6803296Z W0507 20:31:39.676000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last): 2025-05-07T20:31:39.6804661Z W0507 20:31:39.676000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:39.6806130Z W0507 20:31:39.676000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:39.6807502Z W0507 20:31:39.676000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:39.6808865Z W0507 20:31:39.676000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:39.6810147Z W0507 20:31:39.676000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:39.6811508Z W0507 20:31:39.676000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:39.6812909Z W0507 20:31:39.676000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:39.6814136Z W0507 20:31:39.676000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] generator.visit(fn.parse()) 2025-05-07T20:31:39.6815340Z W0507 20:31:39.676000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:39.6816522Z W0507 20:31:39.676000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ret = super().visit(node) 2025-05-07T20:31:39.6817869Z W0507 20:31:39.676000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:31:39.6819020Z W0507 20:31:39.676000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return visitor(node) 2025-05-07T20:31:39.6820366Z W0507 20:31:39.676000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:39.6821631Z W0507 20:31:39.676000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:39.6822722Z W0507 20:31:39.676000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:31:39.6823752Z W0507 20:31:39.676000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] self.visit(item) 2025-05-07T20:31:39.6824922Z W0507 20:31:39.676000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:39.6826276Z W0507 20:31:39.676000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:39.6827314Z W0507 20:31:39.676000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:39.6828214Z W0507 20:31:39.676000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:39.6828945Z W0507 20:31:39.676000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^ 2025-05-07T20:31:39.6829957Z W0507 20:31:39.676000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:40.1291416Z W0507 20:31:40.125000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:40.1292478Z W0507 20:31:40.125000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last): 2025-05-07T20:31:40.1293813Z W0507 20:31:40.125000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:40.1295231Z W0507 20:31:40.125000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:40.1296637Z W0507 20:31:40.125000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:40.1298007Z W0507 20:31:40.125000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:40.1299297Z W0507 20:31:40.125000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:40.1300814Z W0507 20:31:40.125000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:40.1302694Z W0507 20:31:40.125000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:40.1303926Z W0507 20:31:40.125000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] generator.visit(fn.parse()) 2025-05-07T20:31:40.1305142Z W0507 20:31:40.125000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:40.1306321Z W0507 20:31:40.125000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ret = super().visit(node) 2025-05-07T20:31:40.1307339Z W0507 20:31:40.125000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:31:40.1308351Z W0507 20:31:40.125000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return visitor(node) 2025-05-07T20:31:40.1309555Z W0507 20:31:40.125000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:40.1310815Z W0507 20:31:40.125000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:40.1311924Z W0507 20:31:40.125000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:31:40.1312955Z W0507 20:31:40.125000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] self.visit(item) 2025-05-07T20:31:40.1314126Z W0507 20:31:40.125000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:40.1315460Z W0507 20:31:40.125000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:40.1316497Z W0507 20:31:40.125000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:40.1317395Z W0507 20:31:40.125000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:40.1318126Z W0507 20:31:40.125000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^ 2025-05-07T20:31:40.1319140Z W0507 20:31:40.125000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:40.1584110Z W0507 20:31:40.155000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:40.1585219Z W0507 20:31:40.155000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last): 2025-05-07T20:31:40.1586543Z W0507 20:31:40.155000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:40.1587957Z W0507 20:31:40.155000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:40.1589732Z W0507 20:31:40.155000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:40.1591383Z W0507 20:31:40.155000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:40.1592668Z W0507 20:31:40.155000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:40.1594029Z W0507 20:31:40.155000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:40.1595438Z W0507 20:31:40.155000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:40.1596678Z W0507 20:31:40.155000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] generator.visit(fn.parse()) 2025-05-07T20:31:40.1597882Z W0507 20:31:40.155000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:40.1599077Z W0507 20:31:40.155000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ret = super().visit(node) 2025-05-07T20:31:40.1600093Z W0507 20:31:40.155000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:31:40.1601112Z W0507 20:31:40.155000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return visitor(node) 2025-05-07T20:31:40.1602326Z W0507 20:31:40.155000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:40.1603589Z W0507 20:31:40.155000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:40.1604682Z W0507 20:31:40.155000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:31:40.1605714Z W0507 20:31:40.155000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] self.visit(item) 2025-05-07T20:31:40.1606884Z W0507 20:31:40.155000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:40.1608220Z W0507 20:31:40.155000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:40.1609276Z W0507 20:31:40.155000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:40.1610169Z W0507 20:31:40.155000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:40.1610906Z W0507 20:31:40.155000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^ 2025-05-07T20:31:40.1612036Z W0507 20:31:40.155000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:40.5624069Z self = 2025-05-07T20:31:40.5624731Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:40.5625008Z 2025-05-07T20:31:40.5625089Z @given( 2025-05-07T20:31:40.5625330Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:40.5625640Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:40.5625955Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:40.5626294Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:40.5626626Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:40.5626909Z ) 2025-05-07T20:31:40.5627270Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:40.5627721Z def test_silu_mul_quant( 2025-05-07T20:31:40.5627987Z self, 2025-05-07T20:31:40.5628194Z T: int, 2025-05-07T20:31:40.5628399Z D: int, 2025-05-07T20:31:40.5628631Z scale_ub: Optional[float], 2025-05-07T20:31:40.5628911Z contiguous: bool, 2025-05-07T20:31:40.5629157Z compiled: bool, 2025-05-07T20:31:40.5629387Z ) -> None: 2025-05-07T20:31:40.5629612Z torch.manual_seed(2025) 2025-05-07T20:31:40.5629863Z 2025-05-07T20:31:40.5630141Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:40.5630494Z 2025-05-07T20:31:40.5630696Z x_sign = torch.sign(x) 2025-05-07T20:31:40.5630988Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:40.5631304Z x = x_sign * x_clamp 2025-05-07T20:31:40.5631556Z x0 = x[:, :D] 2025-05-07T20:31:40.5631778Z x1 = x[:, D:] 2025-05-07T20:31:40.5631985Z 2025-05-07T20:31:40.5632183Z if contiguous: 2025-05-07T20:31:40.5632429Z x0 = x0.contiguous() 2025-05-07T20:31:40.5632687Z x1 = x1.contiguous() 2025-05-07T20:31:40.5632933Z 2025-05-07T20:31:40.5633142Z if scale_ub is not None: 2025-05-07T20:31:40.5633414Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:40.5633755Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:40.5634074Z ) 2025-05-07T20:31:40.5634271Z else: 2025-05-07T20:31:40.5634492Z scale_ub_tensor = None 2025-05-07T20:31:40.5634752Z 2025-05-07T20:31:40.5634986Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:40.5635302Z op = silu_mul_quant 2025-05-07T20:31:40.5635560Z if compiled: 2025-05-07T20:31:40.5635811Z op = torch.compile(op) 2025-05-07T20:31:40.5636111Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:40.5636387Z 2025-05-07T20:31:40.5636591Z y_fp8, y_scale = fn() 2025-05-07T20:31:40.5636878Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:40.5637184Z 2025-05-07T20:31:40.5637427Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:40.5637766Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:40.5638066Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:40.5638385Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:40.5638741Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:40.5639054Z 2025-05-07T20:31:40.5639269Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:40.5639463Z 2025-05-07T20:31:40.5639567Z moe/activation_test.py:126: 2025-05-07T20:31:40.5639873Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:40.5640214Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:40.5640547Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:40.5641326Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:40.5642607Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:40.5643165Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:40.5643847Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:40.5644539Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:40.5645311Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:40.5646076Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:40.5646821Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:40.5647549Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:40.5648210Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:40.5648809Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:40.5649324Z fn() 2025-05-07T20:31:40.5649836Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:40.5650415Z self.fn.run( 2025-05-07T20:31:40.5650879Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:40.5651407Z kernel = self.compile( 2025-05-07T20:31:40.5651948Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:40.5652605Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:40.5652997Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:40.5653236Z 2025-05-07T20:31:40.5653450Z self = 2025-05-07T20:31:40.5654524Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:40.5655907Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c5658c0d0>} 2025-05-07T20:31:40.5657239Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:40.5658250Z context = 2025-05-07T20:31:40.5658543Z 2025-05-07T20:31:40.5658714Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:40.5659241Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:40.5659701Z module_map=module_map) 2025-05-07T20:31:40.5660325Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:40.5660689Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:40.5660955Z E ^ 2025-05-07T20:31:40.5661414Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:40.5661864Z 2025-05-07T20:31:40.5662279Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:40.5662788Z 2025-05-07T20:31:40.5662899Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:40.5663314Z self=, 2025-05-07T20:31:40.5663812Z T=16384, 2025-05-07T20:31:40.5664013Z D=5120, 2025-05-07T20:31:40.5664211Z scale_ub=None, 2025-05-07T20:31:40.5664499Z contiguous=True, 2025-05-07T20:31:40.5664732Z compiled=True, 2025-05-07T20:31:40.5664940Z ) 2025-05-07T20:31:40.6076844Z W0507 20:31:40.606000 86845 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] torch._dynamo hit config.recompile_limit (8) 2025-05-07T20:31:40.6078106Z W0507 20:31:40.606000 86845 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55) 2025-05-07T20:31:40.6079471Z W0507 20:31:40.606000 86845 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] last reason: 1/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240 2025-05-07T20:31:40.6080457Z W0507 20:31:40.606000 86845 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] To log all recompilation reasons, use TORCH_LOGS="recompiles". 2025-05-07T20:31:40.6081584Z W0507 20:31:40.606000 86845 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html. 2025-05-07T20:31:40.7102144Z self = 2025-05-07T20:31:40.7102907Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:40.7103294Z 2025-05-07T20:31:40.7103397Z @given( 2025-05-07T20:31:40.7103642Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:40.7103965Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:40.7104275Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:40.7104617Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:40.7104957Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:40.7105248Z ) 2025-05-07T20:31:40.7105636Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:40.7106082Z def test_silu_mul_quant( 2025-05-07T20:31:40.7106346Z self, 2025-05-07T20:31:40.7106547Z T: int, 2025-05-07T20:31:40.7106775Z D: int, 2025-05-07T20:31:40.7107010Z scale_ub: Optional[float], 2025-05-07T20:31:40.7107288Z contiguous: bool, 2025-05-07T20:31:40.7107531Z compiled: bool, 2025-05-07T20:31:40.7121225Z ) -> None: 2025-05-07T20:31:40.7121570Z torch.manual_seed(2025) 2025-05-07T20:31:40.7121931Z 2025-05-07T20:31:40.7122304Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:40.7122693Z 2025-05-07T20:31:40.7122926Z x_sign = torch.sign(x) 2025-05-07T20:31:40.7123235Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:40.7123564Z x = x_sign * x_clamp 2025-05-07T20:31:40.7123818Z x0 = x[:, :D] 2025-05-07T20:31:40.7124053Z x1 = x[:, D:] 2025-05-07T20:31:40.7124320Z 2025-05-07T20:31:40.7124526Z if contiguous: 2025-05-07T20:31:40.7124812Z x0 = x0.contiguous() 2025-05-07T20:31:40.7125138Z x1 = x1.contiguous() 2025-05-07T20:31:40.7125399Z 2025-05-07T20:31:40.7125601Z if scale_ub is not None: 2025-05-07T20:31:40.7125892Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:40.7126245Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:40.7126563Z ) 2025-05-07T20:31:40.7126804Z else: 2025-05-07T20:31:40.7127036Z scale_ub_tensor = None 2025-05-07T20:31:40.7127295Z 2025-05-07T20:31:40.7127548Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:40.7127881Z op = silu_mul_quant 2025-05-07T20:31:40.7128153Z if compiled: 2025-05-07T20:31:40.7128414Z op = torch.compile(op) 2025-05-07T20:31:40.7128729Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:40.7129387Z 2025-05-07T20:31:40.7129590Z y_fp8, y_scale = fn() 2025-05-07T20:31:40.7129898Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:40.7130365Z 2025-05-07T20:31:40.7130617Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:40.7130970Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:40.7131284Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:40.7131612Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:40.7131988Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:40.7132309Z 2025-05-07T20:31:40.7132519Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:40.7132726Z 2025-05-07T20:31:40.7132838Z moe/activation_test.py:126: 2025-05-07T20:31:40.7133152Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:40.7133501Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:40.7133835Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:40.7134648Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:40.7135415Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:40.7135973Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:40.7136728Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:40.7137539Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:40.7142906Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:40.7143725Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:40.7165249Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:40.7166012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:40.7166655Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:40.7167244Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:40.7167756Z fn() 2025-05-07T20:31:40.7168260Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:40.7168835Z self.fn.run( 2025-05-07T20:31:40.7169302Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:40.7169836Z kernel = self.compile( 2025-05-07T20:31:40.7170370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:40.7171019Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:40.7171419Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:40.7171644Z 2025-05-07T20:31:40.7171854Z self = 2025-05-07T20:31:40.7172929Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:40.7174295Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c5663a3b0>} 2025-05-07T20:31:40.7175621Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:40.7176790Z context = 2025-05-07T20:31:40.7177077Z 2025-05-07T20:31:40.7177328Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:40.7177850Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:40.7178313Z module_map=module_map) 2025-05-07T20:31:40.7178676Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:40.7179035Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:40.7179301Z E ^ 2025-05-07T20:31:40.7179760Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:40.7180262Z 2025-05-07T20:31:40.7180680Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:40.7181195Z 2025-05-07T20:31:40.7181310Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:40.7181716Z self=, 2025-05-07T20:31:40.7182123Z T=1, 2025-05-07T20:31:40.7182309Z D=5120, 2025-05-07T20:31:40.7182509Z scale_ub=1200.0, 2025-05-07T20:31:40.7182728Z contiguous=True, 2025-05-07T20:31:40.7182947Z compiled=True, 2025-05-07T20:31:40.7183150Z ) 2025-05-07T20:31:41.0648002Z self = 2025-05-07T20:31:41.0648742Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:41.0649111Z 2025-05-07T20:31:41.0649200Z @given( 2025-05-07T20:31:41.0649442Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:41.0649764Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:41.0650078Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:41.0650415Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:41.0650782Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:41.0651076Z ) 2025-05-07T20:31:41.0651449Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:41.0651899Z def test_silu_mul_quant( 2025-05-07T20:31:41.0652144Z self, 2025-05-07T20:31:41.0652350Z T: int, 2025-05-07T20:31:41.0652556Z D: int, 2025-05-07T20:31:41.0652781Z scale_ub: Optional[float], 2025-05-07T20:31:41.0653065Z contiguous: bool, 2025-05-07T20:31:41.0653314Z compiled: bool, 2025-05-07T20:31:41.0653546Z ) -> None: 2025-05-07T20:31:41.0653776Z torch.manual_seed(2025) 2025-05-07T20:31:41.0654035Z 2025-05-07T20:31:41.0654319Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:41.0654694Z 2025-05-07T20:31:41.0654937Z x_sign = torch.sign(x) 2025-05-07T20:31:41.0655233Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:41.0655553Z x = x_sign * x_clamp 2025-05-07T20:31:41.0655816Z x0 = x[:, :D] 2025-05-07T20:31:41.0656039Z x1 = x[:, D:] 2025-05-07T20:31:41.0656262Z 2025-05-07T20:31:41.0656469Z if contiguous: 2025-05-07T20:31:41.0656717Z x0 = x0.contiguous() 2025-05-07T20:31:41.0656985Z x1 = x1.contiguous() 2025-05-07T20:31:41.0657223Z 2025-05-07T20:31:41.0657421Z if scale_ub is not None: 2025-05-07T20:31:41.0657701Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:41.0658037Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:41.0658349Z ) 2025-05-07T20:31:41.0658550Z else: 2025-05-07T20:31:41.0658763Z scale_ub_tensor = None 2025-05-07T20:31:41.0659020Z 2025-05-07T20:31:41.0659261Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:41.0659573Z op = silu_mul_quant 2025-05-07T20:31:41.0659940Z if compiled: 2025-05-07T20:31:41.0660201Z op = torch.compile(op) 2025-05-07T20:31:41.0660873Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.0661143Z 2025-05-07T20:31:41.0661471Z > y_fp8, y_scale = fn() 2025-05-07T20:31:41.0661642Z 2025-05-07T20:31:41.0661755Z moe/activation_test.py:117: 2025-05-07T20:31:41.0662054Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.0662393Z moe/activation_test.py:115: in fn 2025-05-07T20:31:41.0662682Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.0663242Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:41.0663815Z return fn(*args, **kwargs) 2025-05-07T20:31:41.0664479Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:41.0665170Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:41.0665707Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:41.0666412Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:41.0667079Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:41.0667615Z kernel = self.compile( 2025-05-07T20:31:41.0668161Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:41.0668819Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:41.0669222Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.0669449Z 2025-05-07T20:31:41.0669658Z self = 2025-05-07T20:31:41.0670736Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:41.0672145Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c5524feb0>} 2025-05-07T20:31:41.0673490Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:41.0674513Z context = 2025-05-07T20:31:41.0674804Z 2025-05-07T20:31:41.0674973Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:41.0675501Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:41.0675972Z module_map=module_map) 2025-05-07T20:31:41.0676340Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:41.0676704Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:41.0676972Z E ^ 2025-05-07T20:31:41.0677451Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:41.0677897Z 2025-05-07T20:31:41.0678314Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:41.0678830Z 2025-05-07T20:31:41.0678942Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:41.0679365Z self=, 2025-05-07T20:31:41.0679770Z T=1, 2025-05-07T20:31:41.0679953Z D=5120, 2025-05-07T20:31:41.0680154Z scale_ub=None, 2025-05-07T20:31:41.0680379Z contiguous=False, 2025-05-07T20:31:41.0680604Z compiled=True, 2025-05-07T20:31:41.0680823Z ) 2025-05-07T20:31:41.1361620Z self = 2025-05-07T20:31:41.1362761Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:41.1363124Z 2025-05-07T20:31:41.1363455Z @given( 2025-05-07T20:31:41.1363710Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:41.1364038Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:41.1364348Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:41.1364687Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:41.1365020Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:41.1365311Z ) 2025-05-07T20:31:41.1365665Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:41.1366111Z def test_silu_mul_quant( 2025-05-07T20:31:41.1366367Z self, 2025-05-07T20:31:41.1366568Z T: int, 2025-05-07T20:31:41.1366773Z D: int, 2025-05-07T20:31:41.1367004Z scale_ub: Optional[float], 2025-05-07T20:31:41.1367281Z contiguous: bool, 2025-05-07T20:31:41.1367536Z compiled: bool, 2025-05-07T20:31:41.1367780Z ) -> None: 2025-05-07T20:31:41.1368016Z torch.manual_seed(2025) 2025-05-07T20:31:41.1368268Z 2025-05-07T20:31:41.1368554Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:41.1368896Z 2025-05-07T20:31:41.1369102Z x_sign = torch.sign(x) 2025-05-07T20:31:41.1369406Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:41.1369719Z x = x_sign * x_clamp 2025-05-07T20:31:41.1369972Z x0 = x[:, :D] 2025-05-07T20:31:41.1370206Z x1 = x[:, D:] 2025-05-07T20:31:41.1370425Z 2025-05-07T20:31:41.1370622Z if contiguous: 2025-05-07T20:31:41.1370868Z x0 = x0.contiguous() 2025-05-07T20:31:41.1371138Z x1 = x1.contiguous() 2025-05-07T20:31:41.1371379Z 2025-05-07T20:31:41.1371585Z if scale_ub is not None: 2025-05-07T20:31:41.1371869Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:41.1372214Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:41.1372532Z ) 2025-05-07T20:31:41.1372736Z else: 2025-05-07T20:31:41.1372954Z scale_ub_tensor = None 2025-05-07T20:31:41.1373218Z 2025-05-07T20:31:41.1373463Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:41.1373778Z op = silu_mul_quant 2025-05-07T20:31:41.1374041Z if compiled: 2025-05-07T20:31:41.1374300Z op = torch.compile(op) 2025-05-07T20:31:41.1374602Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.1374882Z 2025-05-07T20:31:41.1375088Z y_fp8, y_scale = fn() 2025-05-07T20:31:41.1375383Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:41.1375670Z 2025-05-07T20:31:41.1375919Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:41.1376260Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:41.1376562Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:41.1376887Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:41.1377256Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:41.1377567Z 2025-05-07T20:31:41.1377780Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:41.1377977Z 2025-05-07T20:31:41.1378091Z moe/activation_test.py:126: 2025-05-07T20:31:41.1378392Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.1378740Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:41.1379074Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:41.1379964Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:41.1380717Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:41.1381274Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:41.1382123Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:41.1382824Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:41.1383541Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:41.1384299Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:41.1385096Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:41.1385830Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:41.1386465Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:41.1387071Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:41.1387602Z fn() 2025-05-07T20:31:41.1388121Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:41.1388708Z self.fn.run( 2025-05-07T20:31:41.1389187Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:41.1389731Z kernel = self.compile( 2025-05-07T20:31:41.1390572Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:41.1391236Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:41.1391641Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.1391870Z 2025-05-07T20:31:41.1392083Z self = 2025-05-07T20:31:41.1393171Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:41.1394561Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c564de290>} 2025-05-07T20:31:41.1395903Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:41.1396929Z context = 2025-05-07T20:31:41.1397218Z 2025-05-07T20:31:41.1397388Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:41.1397916Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:41.1398395Z module_map=module_map) 2025-05-07T20:31:41.1398764Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:41.1399134Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:41.1399406Z E ^ 2025-05-07T20:31:41.1399876Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:41.1400327Z 2025-05-07T20:31:41.1400746Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:41.1401260Z 2025-05-07T20:31:41.1401369Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:41.1401789Z self=, 2025-05-07T20:31:41.1402193Z T=1, 2025-05-07T20:31:41.1402379Z D=5120, 2025-05-07T20:31:41.1402583Z scale_ub=None, 2025-05-07T20:31:41.1402806Z contiguous=True, 2025-05-07T20:31:41.1403033Z compiled=False, 2025-05-07T20:31:41.1403393Z ) 2025-05-07T20:31:41.3051576Z self = 2025-05-07T20:31:41.3053119Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:41.3053525Z 2025-05-07T20:31:41.3053657Z @given( 2025-05-07T20:31:41.3054005Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:41.3054478Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:41.3054863Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:41.3055192Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:41.3055529Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:41.3055815Z ) 2025-05-07T20:31:41.3056170Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:41.3056611Z def test_silu_mul_quant( 2025-05-07T20:31:41.3056858Z self, 2025-05-07T20:31:41.3057060Z T: int, 2025-05-07T20:31:41.3057271Z D: int, 2025-05-07T20:31:41.3057499Z scale_ub: Optional[float], 2025-05-07T20:31:41.3057784Z contiguous: bool, 2025-05-07T20:31:41.3058034Z compiled: bool, 2025-05-07T20:31:41.3058272Z ) -> None: 2025-05-07T20:31:41.3058498Z torch.manual_seed(2025) 2025-05-07T20:31:41.3058742Z 2025-05-07T20:31:41.3059026Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:41.3059369Z 2025-05-07T20:31:41.3059566Z x_sign = torch.sign(x) 2025-05-07T20:31:41.3059983Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:41.3060309Z x = x_sign * x_clamp 2025-05-07T20:31:41.3060552Z x0 = x[:, :D] 2025-05-07T20:31:41.3060778Z x1 = x[:, D:] 2025-05-07T20:31:41.3060992Z 2025-05-07T20:31:41.3061189Z if contiguous: 2025-05-07T20:31:41.3061430Z x0 = x0.contiguous() 2025-05-07T20:31:41.3061685Z x1 = x1.contiguous() 2025-05-07T20:31:41.3061930Z 2025-05-07T20:31:41.3062140Z if scale_ub is not None: 2025-05-07T20:31:41.3062419Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:41.3062758Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:41.3063080Z ) 2025-05-07T20:31:41.3063285Z else: 2025-05-07T20:31:41.3063499Z scale_ub_tensor = None 2025-05-07T20:31:41.3063756Z 2025-05-07T20:31:41.3063995Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:41.3064310Z op = silu_mul_quant 2025-05-07T20:31:41.3064571Z if compiled: 2025-05-07T20:31:41.3064827Z op = torch.compile(op) 2025-05-07T20:31:41.3065127Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.3065405Z 2025-05-07T20:31:41.3065607Z > y_fp8, y_scale = fn() 2025-05-07T20:31:41.3065775Z 2025-05-07T20:31:41.3065880Z moe/activation_test.py:117: 2025-05-07T20:31:41.3066185Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.3066526Z moe/activation_test.py:115: in fn 2025-05-07T20:31:41.3066825Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.3067520Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:41.3068215Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:41.3068761Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:41.3069435Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:41.3070096Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:41.3070635Z kernel = self.compile( 2025-05-07T20:31:41.3071179Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:41.3071830Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:41.3072401Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.3072708Z 2025-05-07T20:31:41.3072927Z self = 2025-05-07T20:31:41.3074004Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:41.3075390Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c5524fb50>} 2025-05-07T20:31:41.3076725Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:41.3077764Z context = 2025-05-07T20:31:41.3078054Z 2025-05-07T20:31:41.3078234Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:41.3078748Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:41.3079219Z module_map=module_map) 2025-05-07T20:31:41.3079589Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:41.3079946Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:41.3080206Z E ^ 2025-05-07T20:31:41.3080671Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:41.3081116Z 2025-05-07T20:31:41.3081536Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:41.3082043Z 2025-05-07T20:31:41.3082157Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:41.3082568Z self=, 2025-05-07T20:31:41.3082968Z T=128, 2025-05-07T20:31:41.3083172Z D=5120, 2025-05-07T20:31:41.3083369Z scale_ub=None, 2025-05-07T20:31:41.3083592Z contiguous=False, 2025-05-07T20:31:41.3083826Z compiled=True, 2025-05-07T20:31:41.3084031Z ) 2025-05-07T20:31:41.3084362Z self = 2025-05-07T20:31:41.3084858Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:41.3085124Z 2025-05-07T20:31:41.3085205Z @given( 2025-05-07T20:31:41.3085444Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:41.3085762Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:41.3086075Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:41.3086402Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:41.3086738Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:41.3087037Z ) 2025-05-07T20:31:41.3087391Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:41.3087838Z def test_silu_mul_quant( 2025-05-07T20:31:41.3088086Z self, 2025-05-07T20:31:41.3088282Z T: int, 2025-05-07T20:31:41.3088490Z D: int, 2025-05-07T20:31:41.3088717Z scale_ub: Optional[float], 2025-05-07T20:31:41.3088990Z contiguous: bool, 2025-05-07T20:31:41.3089236Z compiled: bool, 2025-05-07T20:31:41.3089463Z ) -> None: 2025-05-07T20:31:41.3089678Z torch.manual_seed(2025) 2025-05-07T20:31:41.3090211Z 2025-05-07T20:31:41.3090492Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:41.3090833Z 2025-05-07T20:31:41.3091026Z x_sign = torch.sign(x) 2025-05-07T20:31:41.3099130Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:41.3099498Z x = x_sign * x_clamp 2025-05-07T20:31:41.3100141Z x0 = x[:, :D] 2025-05-07T20:31:41.3100379Z x1 = x[:, D:] 2025-05-07T20:31:41.3100599Z 2025-05-07T20:31:41.3100921Z if contiguous: 2025-05-07T20:31:41.3101180Z x0 = x0.contiguous() 2025-05-07T20:31:41.3101453Z x1 = x1.contiguous() 2025-05-07T20:31:41.3101714Z 2025-05-07T20:31:41.3101923Z if scale_ub is not None: 2025-05-07T20:31:41.3102208Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:41.3102564Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:41.3102890Z ) 2025-05-07T20:31:41.3103093Z else: 2025-05-07T20:31:41.3103325Z scale_ub_tensor = None 2025-05-07T20:31:41.3103594Z 2025-05-07T20:31:41.3103845Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:41.3104172Z op = silu_mul_quant 2025-05-07T20:31:41.3104446Z if compiled: 2025-05-07T20:31:41.3104719Z op = torch.compile(op) 2025-05-07T20:31:41.3105033Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.3105325Z 2025-05-07T20:31:41.3105535Z > y_fp8, y_scale = fn() 2025-05-07T20:31:41.3105713Z 2025-05-07T20:31:41.3105821Z moe/activation_test.py:117: 2025-05-07T20:31:41.3106135Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.3106486Z moe/activation_test.py:115: in fn 2025-05-07T20:31:41.3106778Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.3107353Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:41.3107930Z return fn(*args, **kwargs) 2025-05-07T20:31:41.3108603Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:41.3109295Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:41.3109849Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:41.3110545Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:41.3111228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:41.3111761Z kernel = self.compile( 2025-05-07T20:31:41.3112315Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:41.3112980Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:41.3113383Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.3113624Z 2025-05-07T20:31:41.3113837Z self = 2025-05-07T20:31:41.3114925Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:41.3116309Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c557bdc60>} 2025-05-07T20:31:41.3117650Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:41.3118668Z context = 2025-05-07T20:31:41.3118964Z 2025-05-07T20:31:41.3119139Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:41.3119671Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:41.3120155Z module_map=module_map) 2025-05-07T20:31:41.3120528Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:41.3120993Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:41.3121268Z E ^ 2025-05-07T20:31:41.3121816Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:41.3122283Z 2025-05-07T20:31:41.3122707Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:41.3123230Z 2025-05-07T20:31:41.3123343Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:41.3123773Z self=, 2025-05-07T20:31:41.3124186Z T=128, 2025-05-07T20:31:41.3124394Z D=7168, 2025-05-07T20:31:41.3124609Z scale_ub=1200.0, 2025-05-07T20:31:41.3124844Z contiguous=False, 2025-05-07T20:31:41.3125088Z compiled=False, 2025-05-07T20:31:41.3125316Z ) 2025-05-07T20:31:41.4378870Z self = 2025-05-07T20:31:41.4379698Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:41.4380234Z 2025-05-07T20:31:41.4380356Z @given( 2025-05-07T20:31:41.4380680Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:41.4381119Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:41.4381476Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:41.4381818Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:41.4382154Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:41.4382438Z ) 2025-05-07T20:31:41.4382797Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:41.4383243Z def test_silu_mul_quant( 2025-05-07T20:31:41.4383487Z self, 2025-05-07T20:31:41.4383692Z T: int, 2025-05-07T20:31:41.4383889Z D: int, 2025-05-07T20:31:41.4384117Z scale_ub: Optional[float], 2025-05-07T20:31:41.4384398Z contiguous: bool, 2025-05-07T20:31:41.4384644Z compiled: bool, 2025-05-07T20:31:41.4384876Z ) -> None: 2025-05-07T20:31:41.4385099Z torch.manual_seed(2025) 2025-05-07T20:31:41.4385347Z 2025-05-07T20:31:41.4385630Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:41.4385975Z 2025-05-07T20:31:41.4386171Z x_sign = torch.sign(x) 2025-05-07T20:31:41.4386466Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:41.4386781Z x = x_sign * x_clamp 2025-05-07T20:31:41.4387022Z x0 = x[:, :D] 2025-05-07T20:31:41.4387244Z x1 = x[:, D:] 2025-05-07T20:31:41.4387458Z 2025-05-07T20:31:41.4387647Z if contiguous: 2025-05-07T20:31:41.4387878Z x0 = x0.contiguous() 2025-05-07T20:31:41.4388141Z x1 = x1.contiguous() 2025-05-07T20:31:41.4388386Z 2025-05-07T20:31:41.4388575Z if scale_ub is not None: 2025-05-07T20:31:41.4388852Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:41.4389196Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:41.4389499Z ) 2025-05-07T20:31:41.4389705Z else: 2025-05-07T20:31:41.4390200Z scale_ub_tensor = None 2025-05-07T20:31:41.4390454Z 2025-05-07T20:31:41.4390811Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:41.4391131Z op = silu_mul_quant 2025-05-07T20:31:41.4391378Z if compiled: 2025-05-07T20:31:41.4391633Z op = torch.compile(op) 2025-05-07T20:31:41.4391935Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.4392203Z 2025-05-07T20:31:41.4392404Z > y_fp8, y_scale = fn() 2025-05-07T20:31:41.4392575Z 2025-05-07T20:31:41.4392678Z moe/activation_test.py:117: 2025-05-07T20:31:41.4392978Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.4393305Z moe/activation_test.py:115: in fn 2025-05-07T20:31:41.4393594Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.4394784Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:41.4395473Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:41.4396012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:41.4396691Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:41.4397351Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:41.4397881Z kernel = self.compile( 2025-05-07T20:31:41.4398423Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:41.4399077Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:41.4399469Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.4399711Z 2025-05-07T20:31:41.4399919Z self = 2025-05-07T20:31:41.4400996Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:41.4402392Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c557bf9a0>} 2025-05-07T20:31:41.4403732Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:41.4404745Z context = 2025-05-07T20:31:41.4405038Z 2025-05-07T20:31:41.4405210Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:41.4405736Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:41.4406218Z module_map=module_map) 2025-05-07T20:31:41.4406582Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:41.4406944Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:41.4407212Z E ^ 2025-05-07T20:31:41.4407675Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:41.4408128Z 2025-05-07T20:31:41.4408539Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:41.4409059Z 2025-05-07T20:31:41.4409166Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:41.4409577Z self=, 2025-05-07T20:31:41.4409974Z T=128, 2025-05-07T20:31:41.4410166Z D=5120, 2025-05-07T20:31:41.4410365Z scale_ub=None, 2025-05-07T20:31:41.4410582Z contiguous=False, 2025-05-07T20:31:41.4410821Z compiled=False, 2025-05-07T20:31:41.4411036Z ) 2025-05-07T20:31:41.4411353Z self = 2025-05-07T20:31:41.4411845Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:41.4412118Z 2025-05-07T20:31:41.4412198Z @given( 2025-05-07T20:31:41.4412432Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:41.4412741Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:41.4413050Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:41.4413384Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:41.4413708Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:41.4413998Z ) 2025-05-07T20:31:41.4414349Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:41.4414911Z def test_silu_mul_quant( 2025-05-07T20:31:41.4415183Z self, 2025-05-07T20:31:41.4415460Z T: int, 2025-05-07T20:31:41.4415661Z D: int, 2025-05-07T20:31:41.4415885Z scale_ub: Optional[float], 2025-05-07T20:31:41.4416164Z contiguous: bool, 2025-05-07T20:31:41.4416411Z compiled: bool, 2025-05-07T20:31:41.4416635Z ) -> None: 2025-05-07T20:31:41.4416859Z torch.manual_seed(2025) 2025-05-07T20:31:41.4417101Z 2025-05-07T20:31:41.4417375Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:41.4417740Z 2025-05-07T20:31:41.4417945Z x_sign = torch.sign(x) 2025-05-07T20:31:41.4418235Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:41.4418549Z x = x_sign * x_clamp 2025-05-07T20:31:41.4418796Z x0 = x[:, :D] 2025-05-07T20:31:41.4419015Z x1 = x[:, D:] 2025-05-07T20:31:41.4419222Z 2025-05-07T20:31:41.4419425Z if contiguous: 2025-05-07T20:31:41.4419668Z x0 = x0.contiguous() 2025-05-07T20:31:41.4420011Z x1 = x1.contiguous() 2025-05-07T20:31:41.4420256Z 2025-05-07T20:31:41.4420459Z if scale_ub is not None: 2025-05-07T20:31:41.4420731Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:41.4421068Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:41.4421379Z ) 2025-05-07T20:31:41.4421573Z else: 2025-05-07T20:31:41.4421790Z scale_ub_tensor = None 2025-05-07T20:31:41.4422049Z 2025-05-07T20:31:41.4422282Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:41.4422597Z op = silu_mul_quant 2025-05-07T20:31:41.4422853Z if compiled: 2025-05-07T20:31:41.4423101Z op = torch.compile(op) 2025-05-07T20:31:41.4423399Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.4423674Z 2025-05-07T20:31:41.4423867Z > y_fp8, y_scale = fn() 2025-05-07T20:31:41.4424046Z 2025-05-07T20:31:41.4424148Z moe/activation_test.py:117: 2025-05-07T20:31:41.4424451Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.4424789Z moe/activation_test.py:115: in fn 2025-05-07T20:31:41.4425068Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.4425753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:41.4426444Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:41.4426974Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:41.4427655Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:41.4428318Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:41.4428850Z kernel = self.compile( 2025-05-07T20:31:41.4429388Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:41.4430048Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:41.4430448Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.4430675Z 2025-05-07T20:31:41.4430887Z self = 2025-05-07T20:31:41.4431951Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:41.4433312Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c564df370>} 2025-05-07T20:31:41.4434760Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:41.4435850Z context = 2025-05-07T20:31:41.4436135Z 2025-05-07T20:31:41.4436301Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:41.4436827Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:41.4437297Z module_map=module_map) 2025-05-07T20:31:41.4437666Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:41.4438017Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:41.4438284Z E ^ 2025-05-07T20:31:41.4438751Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:41.4439196Z 2025-05-07T20:31:41.4439616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:41.4440131Z 2025-05-07T20:31:41.4440245Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:41.4440661Z self=, 2025-05-07T20:31:41.4441066Z T=128, 2025-05-07T20:31:41.4441257Z D=5120, 2025-05-07T20:31:41.4441460Z scale_ub=1200.0, 2025-05-07T20:31:41.4441691Z contiguous=True, 2025-05-07T20:31:41.4441912Z compiled=False, 2025-05-07T20:31:41.4442123Z ) 2025-05-07T20:31:41.6373925Z self = 2025-05-07T20:31:41.6374706Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:41.6375084Z 2025-05-07T20:31:41.6375172Z @given( 2025-05-07T20:31:41.6375403Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:41.6375718Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:41.6376056Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:41.6376378Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:41.6376720Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:41.6377004Z ) 2025-05-07T20:31:41.6377351Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:41.6377793Z def test_silu_mul_quant( 2025-05-07T20:31:41.6378042Z self, 2025-05-07T20:31:41.6378237Z T: int, 2025-05-07T20:31:41.6378437Z D: int, 2025-05-07T20:31:41.6378661Z scale_ub: Optional[float], 2025-05-07T20:31:41.6378927Z contiguous: bool, 2025-05-07T20:31:41.6379173Z compiled: bool, 2025-05-07T20:31:41.6379404Z ) -> None: 2025-05-07T20:31:41.6379617Z torch.manual_seed(2025) 2025-05-07T20:31:41.6380002Z 2025-05-07T20:31:41.6380279Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:41.6380623Z 2025-05-07T20:31:41.6380827Z x_sign = torch.sign(x) 2025-05-07T20:31:41.6381118Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:41.6381440Z x = x_sign * x_clamp 2025-05-07T20:31:41.6381686Z x0 = x[:, :D] 2025-05-07T20:31:41.6381901Z x1 = x[:, D:] 2025-05-07T20:31:41.6382117Z 2025-05-07T20:31:41.6382310Z if contiguous: 2025-05-07T20:31:41.6382545Z x0 = x0.contiguous() 2025-05-07T20:31:41.6382808Z x1 = x1.contiguous() 2025-05-07T20:31:41.6383055Z 2025-05-07T20:31:41.6383248Z if scale_ub is not None: 2025-05-07T20:31:41.6383527Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:41.6383869Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:41.6384183Z ) 2025-05-07T20:31:41.6384373Z else: 2025-05-07T20:31:41.6384592Z scale_ub_tensor = None 2025-05-07T20:31:41.6384846Z 2025-05-07T20:31:41.6385080Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:41.6385794Z op = silu_mul_quant 2025-05-07T20:31:41.6386048Z if compiled: 2025-05-07T20:31:41.6386425Z op = torch.compile(op) 2025-05-07T20:31:41.6386733Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.6387011Z 2025-05-07T20:31:41.6387204Z > y_fp8, y_scale = fn() 2025-05-07T20:31:41.6387374Z 2025-05-07T20:31:41.6387476Z moe/activation_test.py:117: 2025-05-07T20:31:41.6387773Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.6388099Z moe/activation_test.py:115: in fn 2025-05-07T20:31:41.6388386Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.6389073Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:41.6389760Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:41.6390577Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:41.6391277Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:41.6391973Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:41.6392618Z kernel = self.compile( 2025-05-07T20:31:41.6393159Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:41.6393816Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:41.6394218Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.6394446Z 2025-05-07T20:31:41.6394654Z self = 2025-05-07T20:31:41.6395727Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:41.6397131Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c55825000>} 2025-05-07T20:31:41.6398469Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:41.6399481Z context = 2025-05-07T20:31:41.6399766Z 2025-05-07T20:31:41.6399935Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:41.6400456Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:41.6400923Z module_map=module_map) 2025-05-07T20:31:41.6401287Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:41.6401649Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:41.6401913Z E ^ 2025-05-07T20:31:41.6402389Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:41.6402832Z 2025-05-07T20:31:41.6403252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:41.6403765Z 2025-05-07T20:31:41.6403875Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:41.6404288Z self=, 2025-05-07T20:31:41.6404691Z T=1, 2025-05-07T20:31:41.6404876Z D=7168, 2025-05-07T20:31:41.6405077Z scale_ub=1200.0, 2025-05-07T20:31:41.6405306Z contiguous=True, 2025-05-07T20:31:41.6405529Z compiled=True, 2025-05-07T20:31:41.6405741Z ) 2025-05-07T20:31:41.6406064Z self = 2025-05-07T20:31:41.6406699Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:41.6406969Z 2025-05-07T20:31:41.6407153Z @given( 2025-05-07T20:31:41.6407398Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:41.6407704Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:41.6408014Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:41.6408346Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:41.6408682Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:41.6408965Z ) 2025-05-07T20:31:41.6409324Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:41.6409766Z def test_silu_mul_quant( 2025-05-07T20:31:41.6410003Z self, 2025-05-07T20:31:41.6410204Z T: int, 2025-05-07T20:31:41.6410409Z D: int, 2025-05-07T20:31:41.6410626Z scale_ub: Optional[float], 2025-05-07T20:31:41.6410902Z contiguous: bool, 2025-05-07T20:31:41.6411152Z compiled: bool, 2025-05-07T20:31:41.6411373Z ) -> None: 2025-05-07T20:31:41.6411600Z torch.manual_seed(2025) 2025-05-07T20:31:41.6411847Z 2025-05-07T20:31:41.6412116Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:41.6412459Z 2025-05-07T20:31:41.6412660Z x_sign = torch.sign(x) 2025-05-07T20:31:41.6412946Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:41.6413261Z x = x_sign * x_clamp 2025-05-07T20:31:41.6413509Z x0 = x[:, :D] 2025-05-07T20:31:41.6413731Z x1 = x[:, D:] 2025-05-07T20:31:41.6413936Z 2025-05-07T20:31:41.6414125Z if contiguous: 2025-05-07T20:31:41.6414360Z x0 = x0.contiguous() 2025-05-07T20:31:41.6414614Z x1 = x1.contiguous() 2025-05-07T20:31:41.6414854Z 2025-05-07T20:31:41.6415060Z if scale_ub is not None: 2025-05-07T20:31:41.6415333Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:41.6415678Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:41.6415989Z ) 2025-05-07T20:31:41.6416182Z else: 2025-05-07T20:31:41.6416402Z scale_ub_tensor = None 2025-05-07T20:31:41.6416658Z 2025-05-07T20:31:41.6416889Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:41.6417205Z op = silu_mul_quant 2025-05-07T20:31:41.6417462Z if compiled: 2025-05-07T20:31:41.6417711Z op = torch.compile(op) 2025-05-07T20:31:41.6418018Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.6418300Z 2025-05-07T20:31:41.6418503Z > y_fp8, y_scale = fn() 2025-05-07T20:31:41.6418668Z 2025-05-07T20:31:41.6418770Z moe/activation_test.py:117: 2025-05-07T20:31:41.6419069Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.6419403Z moe/activation_test.py:115: in fn 2025-05-07T20:31:41.6419684Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.6420318Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:41.6420887Z return fn(*args, **kwargs) 2025-05-07T20:31:41.6421549Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:41.6422241Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:41.6422778Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:41.6423463Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:41.6424117Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:41.6424649Z kernel = self.compile( 2025-05-07T20:31:41.6425212Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:41.6425974Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:41.6426467Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.6426703Z 2025-05-07T20:31:41.6426913Z self = 2025-05-07T20:31:41.6427983Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:41.6429341Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c55825360>} 2025-05-07T20:31:41.6438623Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:41.6439743Z context = 2025-05-07T20:31:41.6440042Z 2025-05-07T20:31:41.6440215Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:41.6440746Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:41.6441218Z module_map=module_map) 2025-05-07T20:31:41.6441597Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:41.6441959Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:41.6442230Z E ^ 2025-05-07T20:31:41.6442712Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:41.6443158Z 2025-05-07T20:31:41.6443577Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:41.6444101Z 2025-05-07T20:31:41.6444212Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:41.6444639Z self=, 2025-05-07T20:31:41.6445046Z T=1, 2025-05-07T20:31:41.6445234Z D=7168, 2025-05-07T20:31:41.6445440Z scale_ub=1200.0, 2025-05-07T20:31:41.6445683Z contiguous=False, 2025-05-07T20:31:41.6445913Z compiled=True, 2025-05-07T20:31:41.6446128Z ) 2025-05-07T20:31:41.9915185Z self = 2025-05-07T20:31:41.9915901Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:41.9916257Z 2025-05-07T20:31:41.9916361Z @given( 2025-05-07T20:31:41.9916668Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:41.9917000Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:41.9917315Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:41.9917656Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:41.9918024Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:41.9918312Z ) 2025-05-07T20:31:41.9918689Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:41.9919143Z def test_silu_mul_quant( 2025-05-07T20:31:41.9919399Z self, 2025-05-07T20:31:41.9919605Z T: int, 2025-05-07T20:31:41.9919818Z D: int, 2025-05-07T20:31:41.9920052Z scale_ub: Optional[float], 2025-05-07T20:31:41.9920332Z contiguous: bool, 2025-05-07T20:31:41.9920590Z compiled: bool, 2025-05-07T20:31:41.9920831Z ) -> None: 2025-05-07T20:31:41.9921054Z torch.manual_seed(2025) 2025-05-07T20:31:41.9921310Z 2025-05-07T20:31:41.9921598Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:41.9921944Z 2025-05-07T20:31:41.9922155Z x_sign = torch.sign(x) 2025-05-07T20:31:41.9922463Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:41.9923154Z x = x_sign * x_clamp 2025-05-07T20:31:41.9923409Z x0 = x[:, :D] 2025-05-07T20:31:41.9923637Z x1 = x[:, D:] 2025-05-07T20:31:41.9923979Z 2025-05-07T20:31:41.9924187Z if contiguous: 2025-05-07T20:31:41.9924435Z x0 = x0.contiguous() 2025-05-07T20:31:41.9924714Z x1 = x1.contiguous() 2025-05-07T20:31:41.9924960Z 2025-05-07T20:31:41.9925167Z if scale_ub is not None: 2025-05-07T20:31:41.9925453Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:41.9925795Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:41.9926118Z ) 2025-05-07T20:31:41.9926330Z else: 2025-05-07T20:31:41.9926549Z scale_ub_tensor = None 2025-05-07T20:31:41.9926817Z 2025-05-07T20:31:41.9927065Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:41.9927384Z op = silu_mul_quant 2025-05-07T20:31:41.9927652Z if compiled: 2025-05-07T20:31:41.9927928Z op = torch.compile(op) 2025-05-07T20:31:41.9928253Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.9928544Z 2025-05-07T20:31:41.9928754Z > y_fp8, y_scale = fn() 2025-05-07T20:31:41.9928924Z 2025-05-07T20:31:41.9929040Z moe/activation_test.py:117: 2025-05-07T20:31:41.9929341Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.9929687Z moe/activation_test.py:115: in fn 2025-05-07T20:31:41.9929982Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.9930549Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:41.9931131Z return fn(*args, **kwargs) 2025-05-07T20:31:41.9931788Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:41.9932481Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:41.9933026Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:41.9933717Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:41.9934377Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:41.9934918Z kernel = self.compile( 2025-05-07T20:31:41.9935516Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:41.9936178Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:41.9936575Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.9936812Z 2025-05-07T20:31:41.9937024Z self = 2025-05-07T20:31:41.9938107Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:41.9939494Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c55825630>} 2025-05-07T20:31:41.9940931Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:41.9941952Z context = 2025-05-07T20:31:41.9942247Z 2025-05-07T20:31:41.9942421Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:41.9942951Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:41.9943419Z module_map=module_map) 2025-05-07T20:31:41.9943888Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:41.9944251Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:41.9944612Z E ^ 2025-05-07T20:31:41.9945087Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:41.9945540Z 2025-05-07T20:31:41.9945956Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:41.9946467Z 2025-05-07T20:31:41.9946584Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:41.9946996Z self=, 2025-05-07T20:31:41.9947407Z T=1, 2025-05-07T20:31:41.9947605Z D=7168, 2025-05-07T20:31:41.9947802Z scale_ub=None, 2025-05-07T20:31:41.9948032Z contiguous=False, 2025-05-07T20:31:41.9948268Z compiled=True, 2025-05-07T20:31:41.9948479Z ) 2025-05-07T20:31:42.0906685Z self = 2025-05-07T20:31:42.0907428Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:42.0907691Z 2025-05-07T20:31:42.0907771Z @given( 2025-05-07T20:31:42.0908012Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:42.0908324Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:42.0908625Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:42.0908958Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:42.0909290Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:42.0909570Z ) 2025-05-07T20:31:42.0909926Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:42.0910375Z def test_silu_mul_quant( 2025-05-07T20:31:42.0910641Z self, 2025-05-07T20:31:42.0910845Z T: int, 2025-05-07T20:31:42.0911058Z D: int, 2025-05-07T20:31:42.0911298Z scale_ub: Optional[float], 2025-05-07T20:31:42.0911599Z contiguous: bool, 2025-05-07T20:31:42.0911863Z compiled: bool, 2025-05-07T20:31:42.0912119Z ) -> None: 2025-05-07T20:31:42.0912347Z torch.manual_seed(2025) 2025-05-07T20:31:42.0912617Z 2025-05-07T20:31:42.0912919Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:42.0913307Z 2025-05-07T20:31:42.0913514Z x_sign = torch.sign(x) 2025-05-07T20:31:42.0913839Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:42.0914182Z x = x_sign * x_clamp 2025-05-07T20:31:42.0914445Z x0 = x[:, :D] 2025-05-07T20:31:42.0914680Z x1 = x[:, D:] 2025-05-07T20:31:42.0914898Z 2025-05-07T20:31:42.0915098Z if contiguous: 2025-05-07T20:31:42.0915352Z x0 = x0.contiguous() 2025-05-07T20:31:42.0915638Z x1 = x1.contiguous() 2025-05-07T20:31:42.0915897Z 2025-05-07T20:31:42.0916103Z if scale_ub is not None: 2025-05-07T20:31:42.0916413Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:42.0916781Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:42.0917135Z ) 2025-05-07T20:31:42.0917342Z else: 2025-05-07T20:31:42.0917562Z scale_ub_tensor = None 2025-05-07T20:31:42.0917841Z 2025-05-07T20:31:42.0918095Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:42.0918443Z op = silu_mul_quant 2025-05-07T20:31:42.0918720Z if compiled: 2025-05-07T20:31:42.0918992Z op = torch.compile(op) 2025-05-07T20:31:42.0919317Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:42.0919626Z 2025-05-07T20:31:42.0919834Z y_fp8, y_scale = fn() 2025-05-07T20:31:42.0920144Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:42.0920478Z 2025-05-07T20:31:42.0920736Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:42.0921114Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:42.0921752Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:42.0922206Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:42.0922574Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:42.0922881Z 2025-05-07T20:31:42.0923086Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:42.0923280Z 2025-05-07T20:31:42.0923389Z moe/activation_test.py:126: 2025-05-07T20:31:42.0923683Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:42.0924019Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:42.0924346Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:42.0925134Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:42.0925881Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:42.0926433Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:42.0927129Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:42.0927808Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:42.0928526Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:42.0929276Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:42.0930026Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:42.0930744Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:42.0931389Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:42.0931992Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:42.0932513Z fn() 2025-05-07T20:31:42.0933023Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:42.0933601Z self.fn.run( 2025-05-07T20:31:42.0934066Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:42.0934587Z kernel = self.compile( 2025-05-07T20:31:42.0935126Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:42.0935785Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:42.0936185Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:42.0936410Z 2025-05-07T20:31:42.0936617Z self = 2025-05-07T20:31:42.0937699Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:42.0939075Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c55826440>} 2025-05-07T20:31:42.0940505Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:42.0941522Z context = 2025-05-07T20:31:42.0941807Z 2025-05-07T20:31:42.0941973Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:42.0942492Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:42.0943048Z module_map=module_map) 2025-05-07T20:31:42.0943522Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:42.0943878Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:42.0944147Z E ^ 2025-05-07T20:31:42.0944610Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:42.0945059Z 2025-05-07T20:31:42.0945472Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:42.0945985Z 2025-05-07T20:31:42.0946088Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:42.0946498Z self=, 2025-05-07T20:31:42.0946889Z T=1, 2025-05-07T20:31:42.0947083Z D=5120, 2025-05-07T20:31:42.0947280Z scale_ub=1200.0, 2025-05-07T20:31:42.0947506Z contiguous=False, 2025-05-07T20:31:42.0947734Z compiled=True, 2025-05-07T20:31:42.0947943Z ) 2025-05-07T20:31:42.2624965Z self = 2025-05-07T20:31:42.2625678Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:42.2626060Z 2025-05-07T20:31:42.2626173Z @given( 2025-05-07T20:31:42.2626499Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:42.2626944Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:42.2627302Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:42.2627651Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:42.2627998Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:42.2628301Z ) 2025-05-07T20:31:42.2628658Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:42.2629112Z def test_silu_mul_quant( 2025-05-07T20:31:42.2629367Z self, 2025-05-07T20:31:42.2629583Z T: int, 2025-05-07T20:31:42.2629795Z D: int, 2025-05-07T20:31:42.2630031Z scale_ub: Optional[float], 2025-05-07T20:31:42.2630312Z contiguous: bool, 2025-05-07T20:31:42.2630569Z compiled: bool, 2025-05-07T20:31:42.2630808Z ) -> None: 2025-05-07T20:31:42.2631029Z torch.manual_seed(2025) 2025-05-07T20:31:42.2631284Z 2025-05-07T20:31:42.2631569Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:42.2631915Z 2025-05-07T20:31:42.2632118Z x_sign = torch.sign(x) 2025-05-07T20:31:42.2632420Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:42.2632732Z x = x_sign * x_clamp 2025-05-07T20:31:42.2632984Z x0 = x[:, :D] 2025-05-07T20:31:42.2633212Z x1 = x[:, D:] 2025-05-07T20:31:42.2633424Z 2025-05-07T20:31:42.2633620Z if contiguous: 2025-05-07T20:31:42.2633864Z x0 = x0.contiguous() 2025-05-07T20:31:42.2634132Z x1 = x1.contiguous() 2025-05-07T20:31:42.2634385Z 2025-05-07T20:31:42.2634592Z if scale_ub is not None: 2025-05-07T20:31:42.2634880Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:42.2635224Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:42.2635543Z ) 2025-05-07T20:31:42.2635749Z else: 2025-05-07T20:31:42.2635968Z scale_ub_tensor = None 2025-05-07T20:31:42.2636234Z 2025-05-07T20:31:42.2636475Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:42.2636794Z op = silu_mul_quant 2025-05-07T20:31:42.2637057Z if compiled: 2025-05-07T20:31:42.2637316Z op = torch.compile(op) 2025-05-07T20:31:42.2637627Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:42.2637910Z 2025-05-07T20:31:42.2638105Z > y_fp8, y_scale = fn() 2025-05-07T20:31:42.2638280Z 2025-05-07T20:31:42.2638385Z moe/activation_test.py:117: 2025-05-07T20:31:42.2638692Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:42.2639386Z moe/activation_test.py:115: in fn 2025-05-07T20:31:42.2639808Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:42.2640384Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:42.2640953Z return fn(*args, **kwargs) 2025-05-07T20:31:42.2641616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:42.2642314Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:42.2642862Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:42.2643547Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:42.2644219Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:42.2644769Z kernel = self.compile( 2025-05-07T20:31:42.2645375Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:42.2646040Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:42.2646448Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:42.2646683Z 2025-05-07T20:31:42.2646904Z self = 2025-05-07T20:31:42.2647986Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:42.2649384Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c55076170>} 2025-05-07T20:31:42.2650741Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:42.2651771Z context = 2025-05-07T20:31:42.2652063Z 2025-05-07T20:31:42.2652241Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:42.2652766Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:42.2653242Z module_map=module_map) 2025-05-07T20:31:42.2653618Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:42.2653980Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:42.2654243Z E ^ 2025-05-07T20:31:42.2654712Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:42.2655163Z 2025-05-07T20:31:42.2655594Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:42.2656116Z 2025-05-07T20:31:42.2656230Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:42.2656640Z self=, 2025-05-07T20:31:42.2657045Z T=1, 2025-05-07T20:31:42.2657237Z D=5120, 2025-05-07T20:31:42.2657431Z scale_ub=1200.0, 2025-05-07T20:31:42.2657681Z contiguous=False, 2025-05-07T20:31:42.2657915Z compiled=False, 2025-05-07T20:31:42.2658130Z ) 2025-05-07T20:31:42.2658446Z self = 2025-05-07T20:31:42.2658939Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:42.2659208Z 2025-05-07T20:31:42.2659296Z @given( 2025-05-07T20:31:42.2659525Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:42.2659944Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:42.2660350Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:42.2660751Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:42.2661093Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:42.2661385Z ) 2025-05-07T20:31:42.2661744Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:42.2662181Z def test_silu_mul_quant( 2025-05-07T20:31:42.2662429Z self, 2025-05-07T20:31:42.2662630Z T: int, 2025-05-07T20:31:42.2662828Z D: int, 2025-05-07T20:31:42.2663056Z scale_ub: Optional[float], 2025-05-07T20:31:42.2663334Z contiguous: bool, 2025-05-07T20:31:42.2663574Z compiled: bool, 2025-05-07T20:31:42.2663809Z ) -> None: 2025-05-07T20:31:42.2664035Z torch.manual_seed(2025) 2025-05-07T20:31:42.2664283Z 2025-05-07T20:31:42.2664564Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:42.2664922Z 2025-05-07T20:31:42.2665124Z x_sign = torch.sign(x) 2025-05-07T20:31:42.2665471Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:42.2665788Z x = x_sign * x_clamp 2025-05-07T20:31:42.2666035Z x0 = x[:, :D] 2025-05-07T20:31:42.2666253Z x1 = x[:, D:] 2025-05-07T20:31:42.2666469Z 2025-05-07T20:31:42.2666667Z if contiguous: 2025-05-07T20:31:42.2666901Z x0 = x0.contiguous() 2025-05-07T20:31:42.2667165Z x1 = x1.contiguous() 2025-05-07T20:31:42.2667409Z 2025-05-07T20:31:42.2667602Z if scale_ub is not None: 2025-05-07T20:31:42.2667881Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:42.2668218Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:42.2668522Z ) 2025-05-07T20:31:42.2668721Z else: 2025-05-07T20:31:42.2668942Z scale_ub_tensor = None 2025-05-07T20:31:42.2669195Z 2025-05-07T20:31:42.2669432Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:42.2669753Z op = silu_mul_quant 2025-05-07T20:31:42.2670012Z if compiled: 2025-05-07T20:31:42.2670267Z op = torch.compile(op) 2025-05-07T20:31:42.2670569Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:42.2670841Z 2025-05-07T20:31:42.2671041Z > y_fp8, y_scale = fn() 2025-05-07T20:31:42.2671214Z 2025-05-07T20:31:42.2671316Z moe/activation_test.py:117: 2025-05-07T20:31:42.2671619Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:42.2671950Z moe/activation_test.py:115: in fn 2025-05-07T20:31:42.2672236Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:42.2672923Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:42.2673613Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:42.2674158Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:42.2674862Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:42.2675578Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:42.2676108Z kernel = self.compile( 2025-05-07T20:31:42.2676652Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:42.2677314Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:42.2677706Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:42.2677943Z 2025-05-07T20:31:42.2678150Z self = 2025-05-07T20:31:42.2679228Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:42.2680759Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c55075e10>} 2025-05-07T20:31:42.2682115Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:42.2683130Z context = 2025-05-07T20:31:42.2683427Z 2025-05-07T20:31:42.2683595Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:42.2684125Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:42.2684594Z module_map=module_map) 2025-05-07T20:31:42.2684969Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:42.2685330Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:42.2685607Z E ^ 2025-05-07T20:31:42.2686075Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:42.2686726Z 2025-05-07T20:31:42.2687146Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:42.2687663Z 2025-05-07T20:31:42.2687770Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:42.2688188Z self=, 2025-05-07T20:31:42.2688590Z T=16384, 2025-05-07T20:31:42.2688789Z D=5120, 2025-05-07T20:31:42.2688992Z scale_ub=1200.0, 2025-05-07T20:31:42.2689218Z contiguous=False, 2025-05-07T20:31:42.2689448Z compiled=True, 2025-05-07T20:31:42.2689675Z ) 2025-05-07T20:31:42.3691836Z self = 2025-05-07T20:31:42.3692957Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:42.3693574Z 2025-05-07T20:31:42.3693736Z @given( 2025-05-07T20:31:42.3694205Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:42.3694684Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:42.3703433Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:42.3703846Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:42.3704209Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:42.3704507Z ) 2025-05-07T20:31:42.3704879Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:42.3705336Z def test_silu_mul_quant( 2025-05-07T20:31:42.3705589Z self, 2025-05-07T20:31:42.3705803Z T: int, 2025-05-07T20:31:42.3706019Z D: int, 2025-05-07T20:31:42.3706250Z scale_ub: Optional[float], 2025-05-07T20:31:42.3706555Z contiguous: bool, 2025-05-07T20:31:42.3706811Z compiled: bool, 2025-05-07T20:31:42.3707092Z ) -> None: 2025-05-07T20:31:42.3707331Z torch.manual_seed(2025) 2025-05-07T20:31:42.3707586Z 2025-05-07T20:31:42.3707877Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:42.3708235Z 2025-05-07T20:31:42.3708438Z x_sign = torch.sign(x) 2025-05-07T20:31:42.3708740Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:42.3709054Z x = x_sign * x_clamp 2025-05-07T20:31:42.3709312Z x0 = x[:, :D] 2025-05-07T20:31:42.3709536Z x1 = x[:, D:] 2025-05-07T20:31:42.3709760Z 2025-05-07T20:31:42.3709959Z if contiguous: 2025-05-07T20:31:42.3710199Z x0 = x0.contiguous() 2025-05-07T20:31:42.3710474Z x1 = x1.contiguous() 2025-05-07T20:31:42.3710727Z 2025-05-07T20:31:42.3710924Z if scale_ub is not None: 2025-05-07T20:31:42.3711210Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:42.3711928Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:42.3712372Z ) 2025-05-07T20:31:42.3712587Z else: 2025-05-07T20:31:42.3712818Z scale_ub_tensor = None 2025-05-07T20:31:42.3713078Z 2025-05-07T20:31:42.3713326Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:42.3713659Z op = silu_mul_quant 2025-05-07T20:31:42.3713918Z if compiled: 2025-05-07T20:31:42.3714178Z op = torch.compile(op) 2025-05-07T20:31:42.3714485Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:42.3714770Z 2025-05-07T20:31:42.3714972Z > y_fp8, y_scale = fn() 2025-05-07T20:31:42.3715152Z 2025-05-07T20:31:42.3715259Z moe/activation_test.py:117: 2025-05-07T20:31:42.3715566Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:42.3715908Z moe/activation_test.py:115: in fn 2025-05-07T20:31:42.3716208Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:42.3716785Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:42.3717350Z return fn(*args, **kwargs) 2025-05-07T20:31:42.3718021Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:42.3718718Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:42.3719270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:42.3719955Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:42.3720635Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:42.3721181Z kernel = self.compile( 2025-05-07T20:31:42.3721728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:42.3722401Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:42.3722820Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:42.3723054Z 2025-05-07T20:31:42.3723277Z self = 2025-05-07T20:31:42.3724357Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:42.3725764Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c550743a0>} 2025-05-07T20:31:42.3727119Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:42.3728164Z context = 2025-05-07T20:31:42.3728456Z 2025-05-07T20:31:42.3728636Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:42.3729167Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:42.3729646Z module_map=module_map) 2025-05-07T20:31:42.3730033Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:42.3730392Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:42.3730663Z E ^ 2025-05-07T20:31:42.3731142Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:42.3731595Z 2025-05-07T20:31:42.3732030Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:42.3732641Z 2025-05-07T20:31:42.3732752Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:42.3733242Z self=, 2025-05-07T20:31:42.3733656Z T=2048, 2025-05-07T20:31:42.3733851Z D=7168, 2025-05-07T20:31:42.3734051Z scale_ub=1200.0, 2025-05-07T20:31:42.3734272Z contiguous=False, 2025-05-07T20:31:42.3734505Z compiled=True, 2025-05-07T20:31:42.3734717Z ) 2025-05-07T20:31:42.3735035Z self = 2025-05-07T20:31:42.3735536Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:42.3735818Z 2025-05-07T20:31:42.3735902Z @given( 2025-05-07T20:31:42.3736145Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:42.3736457Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:42.3736774Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:42.3737120Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:42.3737451Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:42.3737751Z ) 2025-05-07T20:31:42.3738109Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:42.3738546Z def test_silu_mul_quant( 2025-05-07T20:31:42.3738793Z self, 2025-05-07T20:31:42.3739001Z T: int, 2025-05-07T20:31:42.3739201Z D: int, 2025-05-07T20:31:42.3739432Z scale_ub: Optional[float], 2025-05-07T20:31:42.3739710Z contiguous: bool, 2025-05-07T20:31:42.3740076Z compiled: bool, 2025-05-07T20:31:42.3740300Z ) -> None: 2025-05-07T20:31:42.3740524Z torch.manual_seed(2025) 2025-05-07T20:31:42.3740771Z 2025-05-07T20:31:42.3741045Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:42.3741395Z 2025-05-07T20:31:42.3741593Z x_sign = torch.sign(x) 2025-05-07T20:31:42.3741888Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:42.3742208Z x = x_sign * x_clamp 2025-05-07T20:31:42.3742452Z x0 = x[:, :D] 2025-05-07T20:31:42.3742674Z x1 = x[:, D:] 2025-05-07T20:31:42.3742889Z 2025-05-07T20:31:42.3743082Z if contiguous: 2025-05-07T20:31:42.3743316Z x0 = x0.contiguous() 2025-05-07T20:31:42.3743577Z x1 = x1.contiguous() 2025-05-07T20:31:42.3743820Z 2025-05-07T20:31:42.3744015Z if scale_ub is not None: 2025-05-07T20:31:42.3744296Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:42.3744637Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:42.3744950Z ) 2025-05-07T20:31:42.3745150Z else: 2025-05-07T20:31:42.3745372Z scale_ub_tensor = None 2025-05-07T20:31:42.3745635Z 2025-05-07T20:31:42.3745870Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:42.3746195Z op = silu_mul_quant 2025-05-07T20:31:42.3746461Z if compiled: 2025-05-07T20:31:42.3746714Z op = torch.compile(op) 2025-05-07T20:31:42.3747021Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:42.3747299Z 2025-05-07T20:31:42.3747497Z > y_fp8, y_scale = fn() 2025-05-07T20:31:42.3747670Z 2025-05-07T20:31:42.3747772Z moe/activation_test.py:117: 2025-05-07T20:31:42.3748082Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:42.3748414Z moe/activation_test.py:115: in fn 2025-05-07T20:31:42.3748702Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:42.3749266Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:42.3749831Z return fn(*args, **kwargs) 2025-05-07T20:31:42.3750489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:42.3751177Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:42.3751808Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:42.3752558Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:42.3753228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:42.3753761Z kernel = self.compile( 2025-05-07T20:31:42.3754305Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:42.3754964Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:42.3755419Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:42.3755650Z 2025-05-07T20:31:42.3755866Z self = 2025-05-07T20:31:42.3756944Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:42.3758310Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c55075fc0>} 2025-05-07T20:31:42.3759647Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:42.3760687Z context = 2025-05-07T20:31:42.3760975Z 2025-05-07T20:31:42.3761151Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:42.3761670Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:42.3762144Z module_map=module_map) 2025-05-07T20:31:42.3762514Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:42.3762878Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:42.3763139Z E ^ 2025-05-07T20:31:42.3763616Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:42.3764064Z 2025-05-07T20:31:42.3764484Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:42.3764994Z 2025-05-07T20:31:42.5040344Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:42.5040996Z self=, 2025-05-07T20:31:42.5041575Z T=1, 2025-05-07T20:31:42.5041839Z D=5120, 2025-05-07T20:31:42.5042109Z scale_ub=None, 2025-05-07T20:31:42.5042343Z contiguous=False, 2025-05-07T20:31:42.5042588Z compiled=False, 2025-05-07T20:31:42.5042816Z ) 2025-05-07T20:31:42.5043148Z self = 2025-05-07T20:31:42.5043651Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:42.5043912Z 2025-05-07T20:31:42.5044001Z @given( 2025-05-07T20:31:42.5044236Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:42.5044557Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:42.5044867Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:42.5045200Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:42.5045649Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:42.5045949Z ) 2025-05-07T20:31:42.5046306Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:42.5046841Z def test_silu_mul_quant( 2025-05-07T20:31:42.5047129Z self, 2025-05-07T20:31:42.5047357Z T: int, 2025-05-07T20:31:42.5047562Z D: int, 2025-05-07T20:31:42.5048006Z scale_ub: Optional[float], 2025-05-07T20:31:42.5048288Z contiguous: bool, 2025-05-07T20:31:42.5048694Z compiled: bool, 2025-05-07T20:31:42.5048921Z ) -> None: 2025-05-07T20:31:42.5049147Z torch.manual_seed(2025) 2025-05-07T20:31:42.5049401Z 2025-05-07T20:31:42.5049673Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:42.5050029Z 2025-05-07T20:31:42.5050227Z x_sign = torch.sign(x) 2025-05-07T20:31:42.5050528Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:42.5050834Z x = x_sign * x_clamp 2025-05-07T20:31:42.5051086Z x0 = x[:, :D] 2025-05-07T20:31:42.5051316Z x1 = x[:, D:] 2025-05-07T20:31:42.5051526Z 2025-05-07T20:31:42.5051720Z if contiguous: 2025-05-07T20:31:42.5051961Z x0 = x0.contiguous() 2025-05-07T20:31:42.5052219Z x1 = x1.contiguous() 2025-05-07T20:31:42.5052456Z 2025-05-07T20:31:42.5052664Z if scale_ub is not None: 2025-05-07T20:31:42.5052937Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:42.5053280Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:42.5053597Z ) 2025-05-07T20:31:42.5053790Z else: 2025-05-07T20:31:42.5054012Z scale_ub_tensor = None 2025-05-07T20:31:42.5054273Z 2025-05-07T20:31:42.5054517Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:42.5054837Z op = silu_mul_quant 2025-05-07T20:31:42.5055093Z if compiled: 2025-05-07T20:31:42.5055349Z op = torch.compile(op) 2025-05-07T20:31:42.5055648Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:42.5055926Z 2025-05-07T20:31:42.5056124Z > y_fp8, y_scale = fn() 2025-05-07T20:31:42.5056289Z 2025-05-07T20:31:42.5056390Z moe/activation_test.py:117: 2025-05-07T20:31:42.5056690Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:42.5057031Z moe/activation_test.py:115: in fn 2025-05-07T20:31:42.5057309Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:42.5058010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:42.5058701Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:42.5059238Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:42.5059984Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:42.5060650Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:42.5061179Z kernel = self.compile( 2025-05-07T20:31:42.5061716Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:42.5062369Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:42.5062773Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:42.5063001Z 2025-05-07T20:31:42.5063223Z self = 2025-05-07T20:31:42.5064294Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:42.5065676Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c55077490>} 2025-05-07T20:31:42.5067007Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:42.5068024Z context = 2025-05-07T20:31:42.5068398Z 2025-05-07T20:31:42.5068644Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:42.5069164Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:42.5069635Z module_map=module_map) 2025-05-07T20:31:42.5070012Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:42.5070362Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:42.5070627Z E ^ 2025-05-07T20:31:42.5071099Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:42.5071541Z 2025-05-07T20:31:42.5071960Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:42.5072479Z 2025-05-07T20:31:42.5072586Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:42.5073007Z self=, 2025-05-07T20:31:42.5073409Z T=4096, 2025-05-07T20:31:42.5073598Z D=7168, 2025-05-07T20:31:42.5073796Z scale_ub=1200.0, 2025-05-07T20:31:42.5074024Z contiguous=False, 2025-05-07T20:31:42.5074247Z compiled=False, 2025-05-07T20:31:42.5074458Z ) 2025-05-07T20:31:42.5074777Z self = 2025-05-07T20:31:42.5075278Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:42.5075549Z 2025-05-07T20:31:42.5075632Z @given( 2025-05-07T20:31:42.5075860Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:42.5076173Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:42.5076486Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:42.5076811Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:42.5077144Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:42.5077439Z ) 2025-05-07T20:31:42.5077784Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:42.5078229Z def test_silu_mul_quant( 2025-05-07T20:31:42.5078474Z self, 2025-05-07T20:31:42.5078666Z T: int, 2025-05-07T20:31:42.5078873Z D: int, 2025-05-07T20:31:42.5079099Z scale_ub: Optional[float], 2025-05-07T20:31:42.5079377Z contiguous: bool, 2025-05-07T20:31:42.5079616Z compiled: bool, 2025-05-07T20:31:42.5079847Z ) -> None: 2025-05-07T20:31:42.5080068Z torch.manual_seed(2025) 2025-05-07T20:31:42.5080307Z 2025-05-07T20:31:42.5080587Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:42.5080936Z 2025-05-07T20:31:42.5081129Z x_sign = torch.sign(x) 2025-05-07T20:31:42.5081423Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:42.5081736Z x = x_sign * x_clamp 2025-05-07T20:31:42.5081982Z x0 = x[:, :D] 2025-05-07T20:31:42.5082202Z x1 = x[:, D:] 2025-05-07T20:31:42.5082414Z 2025-05-07T20:31:42.5082604Z if contiguous: 2025-05-07T20:31:42.5082844Z x0 = x0.contiguous() 2025-05-07T20:31:42.5083104Z x1 = x1.contiguous() 2025-05-07T20:31:42.5083340Z 2025-05-07T20:31:42.5083542Z if scale_ub is not None: 2025-05-07T20:31:42.5083820Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:42.5084151Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:42.5084460Z ) 2025-05-07T20:31:42.5084659Z else: 2025-05-07T20:31:42.5084881Z scale_ub_tensor = None 2025-05-07T20:31:42.5085126Z 2025-05-07T20:31:42.5085363Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:42.5085684Z op = silu_mul_quant 2025-05-07T20:31:42.5085933Z if compiled: 2025-05-07T20:31:42.5086183Z op = torch.compile(op) 2025-05-07T20:31:42.5086578Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:42.5086844Z 2025-05-07T20:31:42.5087046Z > y_fp8, y_scale = fn() 2025-05-07T20:31:42.5087330Z 2025-05-07T20:31:42.5087438Z moe/activation_test.py:117: 2025-05-07T20:31:42.5087734Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:42.5088066Z moe/activation_test.py:115: in fn 2025-05-07T20:31:42.5088351Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:42.5089034Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:42.5089717Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:42.5090527Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:42.5091205Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:42.5091860Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:42.5092402Z kernel = self.compile( 2025-05-07T20:31:42.5092951Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:42.5093606Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:42.5093995Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:42.5094231Z 2025-05-07T20:31:42.5094439Z self = 2025-05-07T20:31:42.5095506Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:42.5096878Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c54500550>} 2025-05-07T20:31:42.5098208Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:42.5099225Z context = 2025-05-07T20:31:42.5099516Z 2025-05-07T20:31:42.5099687Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:42.5100297Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:42.5100760Z module_map=module_map) 2025-05-07T20:31:42.5101129Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:42.5101489Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:42.5101755Z E ^ 2025-05-07T20:31:42.5102216Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:42.5102679Z 2025-05-07T20:31:42.5103093Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:42.5103608Z 2025-05-07T20:31:42.5103717Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:42.5104127Z self=, 2025-05-07T20:31:42.5104521Z T=16384, 2025-05-07T20:31:42.5104716Z D=7168, 2025-05-07T20:31:42.5104913Z scale_ub=None, 2025-05-07T20:31:42.5105128Z contiguous=True, 2025-05-07T20:31:42.5105356Z compiled=True, 2025-05-07T20:31:42.5105561Z ) 2025-05-07T20:31:42.7047299Z self = 2025-05-07T20:31:42.7048202Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:42.7048679Z 2025-05-07T20:31:42.7048809Z @given( 2025-05-07T20:31:42.7049196Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:42.7050144Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:42.7050794Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:42.7051383Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:42.7051935Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:42.7052418Z ) 2025-05-07T20:31:42.7053016Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:42.7053777Z def test_silu_mul_quant( 2025-05-07T20:31:42.7054176Z self, 2025-05-07T20:31:42.7054493Z T: int, 2025-05-07T20:31:42.7054807Z D: int, 2025-05-07T20:31:42.7055163Z scale_ub: Optional[float], 2025-05-07T20:31:42.7055618Z contiguous: bool, 2025-05-07T20:31:42.7056013Z compiled: bool, 2025-05-07T20:31:42.7056378Z ) -> None: 2025-05-07T20:31:42.7056729Z torch.manual_seed(2025) 2025-05-07T20:31:42.7057133Z 2025-05-07T20:31:42.7057586Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:42.7058171Z 2025-05-07T20:31:42.7058499Z x_sign = torch.sign(x) 2025-05-07T20:31:42.7058973Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:42.7059499Z x = x_sign * x_clamp 2025-05-07T20:31:42.7059999Z x0 = x[:, :D] 2025-05-07T20:31:42.7060349Z x1 = x[:, D:] 2025-05-07T20:31:42.7060697Z 2025-05-07T20:31:42.7061000Z if contiguous: 2025-05-07T20:31:42.7061372Z x0 = x0.contiguous() 2025-05-07T20:31:42.7061807Z x1 = x1.contiguous() 2025-05-07T20:31:42.7062211Z 2025-05-07T20:31:42.7062515Z if scale_ub is not None: 2025-05-07T20:31:42.7062973Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:42.7063529Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:42.7064046Z ) 2025-05-07T20:31:42.7064353Z else: 2025-05-07T20:31:42.7064694Z scale_ub_tensor = None 2025-05-07T20:31:42.7065158Z 2025-05-07T20:31:42.7065565Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:42.7066106Z op = silu_mul_quant 2025-05-07T20:31:42.7066531Z if compiled: 2025-05-07T20:31:42.7066930Z op = torch.compile(op) 2025-05-07T20:31:42.7067431Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:42.7067896Z 2025-05-07T20:31:42.7068203Z > y_fp8, y_scale = fn() 2025-05-07T20:31:42.7079519Z 2025-05-07T20:31:42.7079703Z moe/activation_test.py:117: 2025-05-07T20:31:42.7080226Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:42.7080801Z moe/activation_test.py:115: in fn 2025-05-07T20:31:42.7081288Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:42.7082285Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:42.7083288Z return fn(*args, **kwargs) 2025-05-07T20:31:42.7084486Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:42.7085742Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:42.7086696Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:42.7087912Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:42.7089096Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:42.7090396Z kernel = self.compile( 2025-05-07T20:31:42.7091366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:42.7092448Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:42.7093144Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:42.7093782Z 2025-05-07T20:31:42.7094153Z self = 2025-05-07T20:31:42.7096311Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:42.7098958Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c54501360>} 2025-05-07T20:31:42.7101499Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:42.7103352Z context = 2025-05-07T20:31:42.7103869Z 2025-05-07T20:31:42.7104152Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:42.7105097Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:42.7105919Z module_map=module_map) 2025-05-07T20:31:42.7106548Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:42.7107159Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:42.7107600Z E ^ 2025-05-07T20:31:42.7108409Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:42.7109238Z 2025-05-07T20:31:42.7109987Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:42.7110923Z 2025-05-07T20:31:42.7111104Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:42.7111759Z self=, 2025-05-07T20:31:42.7112285Z T=4096, 2025-05-07T20:31:42.7112540Z D=5120, 2025-05-07T20:31:42.7112798Z scale_ub=None, 2025-05-07T20:31:42.7113091Z contiguous=False, 2025-05-07T20:31:42.7113406Z compiled=True, 2025-05-07T20:31:42.7113698Z ) 2025-05-07T20:31:42.7114105Z self = 2025-05-07T20:31:42.7114772Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:42.7115149Z 2025-05-07T20:31:42.7115267Z @given( 2025-05-07T20:31:42.7115573Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:42.7115998Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:42.7116435Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:42.7116923Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:42.7117400Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:42.7117814Z ) 2025-05-07T20:31:42.7118298Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:42.7118944Z def test_silu_mul_quant( 2025-05-07T20:31:42.7119306Z self, 2025-05-07T20:31:42.7119606Z T: int, 2025-05-07T20:31:42.7119894Z D: int, 2025-05-07T20:31:42.7120222Z scale_ub: Optional[float], 2025-05-07T20:31:42.7120611Z contiguous: bool, 2025-05-07T20:31:42.7120965Z compiled: bool, 2025-05-07T20:31:42.7121302Z ) -> None: 2025-05-07T20:31:42.7121601Z torch.manual_seed(2025) 2025-05-07T20:31:42.7121981Z 2025-05-07T20:31:42.7122358Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:42.7122869Z 2025-05-07T20:31:42.7123172Z x_sign = torch.sign(x) 2025-05-07T20:31:42.7123638Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:42.7124146Z x = x_sign * x_clamp 2025-05-07T20:31:42.7124538Z x0 = x[:, :D] 2025-05-07T20:31:42.7124870Z x1 = x[:, D:] 2025-05-07T20:31:42.7125198Z 2025-05-07T20:31:42.7125646Z if contiguous: 2025-05-07T20:31:42.7126003Z x0 = x0.contiguous() 2025-05-07T20:31:42.7126417Z x1 = x1.contiguous() 2025-05-07T20:31:42.7126895Z 2025-05-07T20:31:42.7127194Z if scale_ub is not None: 2025-05-07T20:31:42.7127623Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:42.7128162Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:42.7128658Z ) 2025-05-07T20:31:42.7128955Z else: 2025-05-07T20:31:42.7129272Z scale_ub_tensor = None 2025-05-07T20:31:42.7129670Z 2025-05-07T20:31:42.7130029Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:42.7130525Z op = silu_mul_quant 2025-05-07T20:31:42.7130921Z if compiled: 2025-05-07T20:31:42.7131314Z op = torch.compile(op) 2025-05-07T20:31:42.7131784Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:42.7132220Z 2025-05-07T20:31:42.7132499Z > y_fp8, y_scale = fn() 2025-05-07T20:31:42.7132772Z 2025-05-07T20:31:42.7132923Z moe/activation_test.py:117: 2025-05-07T20:31:42.7133390Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:42.7133908Z moe/activation_test.py:115: in fn 2025-05-07T20:31:42.7134362Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:42.7135267Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:42.7136198Z return fn(*args, **kwargs) 2025-05-07T20:31:42.7137335Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:42.7138510Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:42.7139449Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:42.7140699Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:42.7141845Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:42.7142754Z kernel = self.compile( 2025-05-07T20:31:42.7143690Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:42.7144843Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:42.7145492Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:42.7145881Z 2025-05-07T20:31:42.7146207Z self = 2025-05-07T20:31:42.7148138Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:42.7150640Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c54501ea0>} 2025-05-07T20:31:42.7153027Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:42.7154824Z context = 2025-05-07T20:31:42.7155352Z 2025-05-07T20:31:42.7155664Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:42.7156511Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:42.7157276Z module_map=module_map) 2025-05-07T20:31:42.7157874Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:42.7158446Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:42.7158865Z E ^ 2025-05-07T20:31:42.7159794Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:42.7160573Z 2025-05-07T20:31:42.7161402Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:42.7162316Z 2025-05-07T20:31:43.0856321Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:43.0857153Z self=, 2025-05-07T20:31:43.0857849Z T=4096, 2025-05-07T20:31:43.0858160Z D=5120, 2025-05-07T20:31:43.0858471Z scale_ub=1200.0, 2025-05-07T20:31:43.0858804Z contiguous=False, 2025-05-07T20:31:43.0859155Z compiled=False, 2025-05-07T20:31:43.0859469Z ) 2025-05-07T20:31:43.0860055Z self = 2025-05-07T20:31:43.0860909Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:43.0861387Z 2025-05-07T20:31:43.0861553Z @given( 2025-05-07T20:31:43.0861924Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:43.0862466Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:43.0862985Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:43.0863544Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:43.0864095Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:43.0864574Z ) 2025-05-07T20:31:43.0865171Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:43.0865931Z def test_silu_mul_quant( 2025-05-07T20:31:43.0866333Z self, 2025-05-07T20:31:43.0866652Z T: int, 2025-05-07T20:31:43.0866967Z D: int, 2025-05-07T20:31:43.0867323Z scale_ub: Optional[float], 2025-05-07T20:31:43.0867772Z contiguous: bool, 2025-05-07T20:31:43.0868160Z compiled: bool, 2025-05-07T20:31:43.0868529Z ) -> None: 2025-05-07T20:31:43.0868885Z torch.manual_seed(2025) 2025-05-07T20:31:43.0869286Z 2025-05-07T20:31:43.0869732Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:43.0870320Z 2025-05-07T20:31:43.0870623Z x_sign = torch.sign(x) 2025-05-07T20:31:43.0871104Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:43.0871626Z x = x_sign * x_clamp 2025-05-07T20:31:43.0872025Z x0 = x[:, :D] 2025-05-07T20:31:43.0872370Z x1 = x[:, D:] 2025-05-07T20:31:43.0872710Z 2025-05-07T20:31:43.0873011Z if contiguous: 2025-05-07T20:31:43.0873380Z x0 = x0.contiguous() 2025-05-07T20:31:43.0873807Z x1 = x1.contiguous() 2025-05-07T20:31:43.0874204Z 2025-05-07T20:31:43.0874509Z if scale_ub is not None: 2025-05-07T20:31:43.0874963Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:43.0875522Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:43.0876028Z ) 2025-05-07T20:31:43.0876349Z else: 2025-05-07T20:31:43.0876695Z scale_ub_tensor = None 2025-05-07T20:31:43.0877110Z 2025-05-07T20:31:43.0877496Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:43.0878024Z op = silu_mul_quant 2025-05-07T20:31:43.0878430Z if compiled: 2025-05-07T20:31:43.0878836Z op = torch.compile(op) 2025-05-07T20:31:43.0879326Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:43.0879786Z 2025-05-07T20:31:43.0880093Z > y_fp8, y_scale = fn() 2025-05-07T20:31:43.0880383Z 2025-05-07T20:31:43.0880543Z moe/activation_test.py:117: 2025-05-07T20:31:43.0881035Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:43.0881565Z moe/activation_test.py:115: in fn 2025-05-07T20:31:43.0882034Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:43.0883207Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:43.0884820Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:43.0886045Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:43.0887222Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:43.0888406Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:43.0889345Z kernel = self.compile( 2025-05-07T20:31:43.0890625Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:43.0891802Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:43.0892490Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:43.0892890Z 2025-05-07T20:31:43.0893238Z self = 2025-05-07T20:31:43.0895209Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:43.0897721Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c54502680>} 2025-05-07T20:31:43.0900265Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:43.0902014Z context = 2025-05-07T20:31:43.0902526Z 2025-05-07T20:31:43.0902802Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:43.0903713Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:43.0904536Z module_map=module_map) 2025-05-07T20:31:43.0905150Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:43.0905745Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:43.0906181Z E ^ 2025-05-07T20:31:43.0906982Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:43.0909032Z 2025-05-07T20:31:43.0909777Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:43.0910708Z 2025-05-07T20:31:43.0910877Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:43.0911590Z self=, 2025-05-07T20:31:43.0912269Z T=4096, 2025-05-07T20:31:43.0912572Z D=5120, 2025-05-07T20:31:43.0912885Z scale_ub=1200.0, 2025-05-07T20:31:43.0913250Z contiguous=False, 2025-05-07T20:31:43.0913619Z compiled=True, 2025-05-07T20:31:43.0913957Z ) 2025-05-07T20:31:43.0914490Z self = 2025-05-07T20:31:43.0915365Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:43.0915882Z 2025-05-07T20:31:43.0916005Z @given( 2025-05-07T20:31:43.0916376Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:43.0916896Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:43.0917409Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:43.0917963Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:43.0918514Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:43.0918999Z ) 2025-05-07T20:31:43.0919596Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:43.0920353Z def test_silu_mul_quant( 2025-05-07T20:31:43.0920755Z self, 2025-05-07T20:31:43.0921291Z T: int, 2025-05-07T20:31:43.0921604Z D: int, 2025-05-07T20:31:43.0921961Z scale_ub: Optional[float], 2025-05-07T20:31:43.0922537Z contiguous: bool, 2025-05-07T20:31:43.0922902Z compiled: bool, 2025-05-07T20:31:43.0923183Z ) -> None: 2025-05-07T20:31:43.0923461Z torch.manual_seed(2025) 2025-05-07T20:31:43.0923783Z 2025-05-07T20:31:43.0924148Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:43.0924627Z 2025-05-07T20:31:43.0924876Z x_sign = torch.sign(x) 2025-05-07T20:31:43.0925256Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:43.0925670Z x = x_sign * x_clamp 2025-05-07T20:31:43.0925999Z x0 = x[:, :D] 2025-05-07T20:31:43.0926281Z x1 = x[:, D:] 2025-05-07T20:31:43.0926564Z 2025-05-07T20:31:43.0926810Z if contiguous: 2025-05-07T20:31:43.0927115Z x0 = x0.contiguous() 2025-05-07T20:31:43.0927495Z x1 = x1.contiguous() 2025-05-07T20:31:43.0927858Z 2025-05-07T20:31:43.0928134Z if scale_ub is not None: 2025-05-07T20:31:43.0928534Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:43.0929010Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:43.0929435Z ) 2025-05-07T20:31:43.0929703Z else: 2025-05-07T20:31:43.0930010Z scale_ub_tensor = None 2025-05-07T20:31:43.0930390Z 2025-05-07T20:31:43.0930717Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:43.0931178Z op = silu_mul_quant 2025-05-07T20:31:43.0931543Z if compiled: 2025-05-07T20:31:43.0931896Z op = torch.compile(op) 2025-05-07T20:31:43.0932344Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:43.0932721Z 2025-05-07T20:31:43.0933018Z > y_fp8, y_scale = fn() 2025-05-07T20:31:43.0933241Z 2025-05-07T20:31:43.0933388Z moe/activation_test.py:117: 2025-05-07T20:31:43.0933839Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:43.0934377Z moe/activation_test.py:115: in fn 2025-05-07T20:31:43.0934835Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:43.0935760Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:43.0936703Z return fn(*args, **kwargs) 2025-05-07T20:31:43.0937817Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:43.0938937Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:43.0939780Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:43.0941008Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:43.0942108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:43.0943005Z kernel = self.compile( 2025-05-07T20:31:43.0943896Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:43.0945036Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:43.0945756Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:43.0946149Z 2025-05-07T20:31:43.0946455Z self = 2025-05-07T20:31:43.0948304Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:43.0950762Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c54503ac0>} 2025-05-07T20:31:43.0953360Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:43.0955234Z context = 2025-05-07T20:31:43.0955722Z 2025-05-07T20:31:43.0956003Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:43.0956910Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:43.0957694Z module_map=module_map) 2025-05-07T20:31:43.0958288Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:43.0958874Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:43.0959306Z E ^ 2025-05-07T20:31:43.0960105Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:43.0960908Z 2025-05-07T20:31:43.0961653Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:43.0962504Z 2025-05-07T20:31:43.2235847Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:43.2236666Z self=, 2025-05-07T20:31:43.2237371Z T=2048, 2025-05-07T20:31:43.2237688Z D=7168, 2025-05-07T20:31:43.2238004Z scale_ub=1200.0, 2025-05-07T20:31:43.2238372Z contiguous=False, 2025-05-07T20:31:43.2238723Z compiled=False, 2025-05-07T20:31:43.2239050Z ) 2025-05-07T20:31:43.2239514Z self = 2025-05-07T20:31:43.2240355Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:43.2240837Z 2025-05-07T20:31:43.2240971Z @given( 2025-05-07T20:31:43.2241347Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:43.2241910Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:43.2242429Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:43.2242993Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:43.2243553Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:43.2244041Z ) 2025-05-07T20:31:43.2244641Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:43.2245403Z def test_silu_mul_quant( 2025-05-07T20:31:43.2245808Z self, 2025-05-07T20:31:43.2246132Z T: int, 2025-05-07T20:31:43.2246444Z D: int, 2025-05-07T20:31:43.2246804Z scale_ub: Optional[float], 2025-05-07T20:31:43.2247263Z contiguous: bool, 2025-05-07T20:31:43.2247652Z compiled: bool, 2025-05-07T20:31:43.2248022Z ) -> None: 2025-05-07T20:31:43.2248377Z torch.manual_seed(2025) 2025-05-07T20:31:43.2248773Z 2025-05-07T20:31:43.2249221Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:43.2249812Z 2025-05-07T20:31:43.2250123Z x_sign = torch.sign(x) 2025-05-07T20:31:43.2250617Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:43.2251140Z x = x_sign * x_clamp 2025-05-07T20:31:43.2251528Z x0 = x[:, :D] 2025-05-07T20:31:43.2251883Z x1 = x[:, D:] 2025-05-07T20:31:43.2252222Z 2025-05-07T20:31:43.2252517Z if contiguous: 2025-05-07T20:31:43.2252903Z x0 = x0.contiguous() 2025-05-07T20:31:43.2253338Z x1 = x1.contiguous() 2025-05-07T20:31:43.2253739Z 2025-05-07T20:31:43.2254042Z if scale_ub is not None: 2025-05-07T20:31:43.2254500Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:43.2255066Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:43.2255576Z ) 2025-05-07T20:31:43.2255890Z else: 2025-05-07T20:31:43.2256238Z scale_ub_tensor = None 2025-05-07T20:31:43.2256657Z 2025-05-07T20:31:43.2257459Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:43.2257993Z op = silu_mul_quant 2025-05-07T20:31:43.2258582Z if compiled: 2025-05-07T20:31:43.2259003Z op = torch.compile(op) 2025-05-07T20:31:43.2259498Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:43.2260058Z 2025-05-07T20:31:43.2260373Z > y_fp8, y_scale = fn() 2025-05-07T20:31:43.2260653Z 2025-05-07T20:31:43.2260823Z moe/activation_test.py:117: 2025-05-07T20:31:43.2261301Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:43.2261834Z moe/activation_test.py:115: in fn 2025-05-07T20:31:43.2262299Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:43.2263473Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:43.2264656Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:43.2265583Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:43.2266755Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:43.2267928Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:43.2268867Z kernel = self.compile( 2025-05-07T20:31:43.2269814Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:43.2270982Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:43.2271659Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:43.2272062Z 2025-05-07T20:31:43.2272406Z self = 2025-05-07T20:31:43.2274347Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:43.2276930Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c55826200>} 2025-05-07T20:31:43.2279372Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:43.2281108Z context = 2025-05-07T20:31:43.2281620Z 2025-05-07T20:31:43.2281899Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:43.2282810Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:43.2283625Z module_map=module_map) 2025-05-07T20:31:43.2284237Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:43.2284842Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:43.2285278Z E ^ 2025-05-07T20:31:43.2286081Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:43.2286903Z 2025-05-07T20:31:43.2287643Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:43.2288573Z 2025-05-07T20:31:43.2288744Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:43.2289454Z self=, 2025-05-07T20:31:43.2290499Z T=1, 2025-05-07T20:31:43.2290807Z D=7168, 2025-05-07T20:31:43.2291127Z scale_ub=None, 2025-05-07T20:31:43.2291499Z contiguous=True, 2025-05-07T20:31:43.2291862Z compiled=False, 2025-05-07T20:31:43.2292204Z ) 2025-05-07T20:31:43.2293574Z self = 2025-05-07T20:31:43.2294548Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:43.2295022Z 2025-05-07T20:31:43.2295149Z @given( 2025-05-07T20:31:43.2295529Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:43.2296054Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:43.2307045Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:43.2307587Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:43.2308092Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:43.2308518Z ) 2025-05-07T20:31:43.2309017Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:43.2309657Z def test_silu_mul_quant( 2025-05-07T20:31:43.2310022Z self, 2025-05-07T20:31:43.2310313Z T: int, 2025-05-07T20:31:43.2310590Z D: int, 2025-05-07T20:31:43.2310910Z scale_ub: Optional[float], 2025-05-07T20:31:43.2311307Z contiguous: bool, 2025-05-07T20:31:43.2311638Z compiled: bool, 2025-05-07T20:31:43.2311980Z ) -> None: 2025-05-07T20:31:43.2312308Z torch.manual_seed(2025) 2025-05-07T20:31:43.2312644Z 2025-05-07T20:31:43.2313066Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:43.2313551Z 2025-05-07T20:31:43.2313830Z x_sign = torch.sign(x) 2025-05-07T20:31:43.2314301Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:43.2314808Z x = x_sign * x_clamp 2025-05-07T20:31:43.2315183Z x0 = x[:, :D] 2025-05-07T20:31:43.2315521Z x1 = x[:, D:] 2025-05-07T20:31:43.2315849Z 2025-05-07T20:31:43.2316134Z if contiguous: 2025-05-07T20:31:43.2316500Z x0 = x0.contiguous() 2025-05-07T20:31:43.2316914Z x1 = x1.contiguous() 2025-05-07T20:31:43.2317294Z 2025-05-07T20:31:43.2317598Z if scale_ub is not None: 2025-05-07T20:31:43.2318045Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:43.2318584Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:43.2319079Z ) 2025-05-07T20:31:43.2319383Z else: 2025-05-07T20:31:43.2319715Z scale_ub_tensor = None 2025-05-07T20:31:43.2320107Z 2025-05-07T20:31:43.2320475Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:43.2320983Z op = silu_mul_quant 2025-05-07T20:31:43.2321372Z if compiled: 2025-05-07T20:31:43.2321758Z op = torch.compile(op) 2025-05-07T20:31:43.2322218Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:43.2322649Z 2025-05-07T20:31:43.2322949Z > y_fp8, y_scale = fn() 2025-05-07T20:31:43.2323212Z 2025-05-07T20:31:43.2323375Z moe/activation_test.py:117: 2025-05-07T20:31:43.2323837Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:43.2324372Z moe/activation_test.py:115: in fn 2025-05-07T20:31:43.2324828Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:43.2325999Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:43.2327172Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:43.2328069Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:43.2329223Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:43.2330342Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:43.2331237Z kernel = self.compile( 2025-05-07T20:31:43.2332141Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:43.2333248Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:43.2334041Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:43.2334427Z 2025-05-07T20:31:43.2334840Z self = 2025-05-07T20:31:43.2336752Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:43.2339156Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c54c484c0>} 2025-05-07T20:31:43.2341602Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:43.2343348Z context = 2025-05-07T20:31:43.2343841Z 2025-05-07T20:31:43.2344105Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:43.2344984Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:43.2345755Z module_map=module_map) 2025-05-07T20:31:43.2346348Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:43.2346919Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:43.2347334Z E ^ 2025-05-07T20:31:43.2348090Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:43.2348872Z 2025-05-07T20:31:43.2349580Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:43.2350458Z 2025-05-07T20:31:43.2350629Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:43.2351301Z self=, 2025-05-07T20:31:43.2351964Z T=16384, 2025-05-07T20:31:43.2352253Z D=7168, 2025-05-07T20:31:43.2352558Z scale_ub=1200.0, 2025-05-07T20:31:43.2352907Z contiguous=False, 2025-05-07T20:31:43.2353250Z compiled=True, 2025-05-07T20:31:43.4983437Z ) 2025-05-07T20:31:43.4984125Z self = 2025-05-07T20:31:43.4984952Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:43.4985425Z 2025-05-07T20:31:43.4985556Z @given( 2025-05-07T20:31:43.4985921Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:43.4986428Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:43.4986942Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:43.4987500Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:43.4988024Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:43.4988426Z ) 2025-05-07T20:31:43.4988952Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:43.4989607Z def test_silu_mul_quant( 2025-05-07T20:31:43.4990425Z self, 2025-05-07T20:31:43.4990719Z T: int, 2025-05-07T20:31:43.4991014Z D: int, 2025-05-07T20:31:43.4991333Z scale_ub: Optional[float], 2025-05-07T20:31:43.4991755Z contiguous: bool, 2025-05-07T20:31:43.4992128Z compiled: bool, 2025-05-07T20:31:43.4992464Z ) -> None: 2025-05-07T20:31:43.4992774Z torch.manual_seed(2025) 2025-05-07T20:31:43.4993125Z 2025-05-07T20:31:43.4993512Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:43.4994022Z 2025-05-07T20:31:43.4994321Z x_sign = torch.sign(x) 2025-05-07T20:31:43.4994765Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:43.4995233Z x = x_sign * x_clamp 2025-05-07T20:31:43.4995616Z x0 = x[:, :D] 2025-05-07T20:31:43.4996405Z x1 = x[:, D:] 2025-05-07T20:31:43.4996760Z 2025-05-07T20:31:43.4997070Z if contiguous: 2025-05-07T20:31:43.4997644Z x0 = x0.contiguous() 2025-05-07T20:31:43.4998077Z x1 = x1.contiguous() 2025-05-07T20:31:43.4998469Z 2025-05-07T20:31:43.4998777Z if scale_ub is not None: 2025-05-07T20:31:43.4999227Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:43.4999779Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:43.5000284Z ) 2025-05-07T20:31:43.5000588Z else: 2025-05-07T20:31:43.5000922Z scale_ub_tensor = None 2025-05-07T20:31:43.5001353Z 2025-05-07T20:31:43.5001729Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:43.5002259Z op = silu_mul_quant 2025-05-07T20:31:43.5002675Z if compiled: 2025-05-07T20:31:43.5003074Z op = torch.compile(op) 2025-05-07T20:31:43.5003558Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:43.5004036Z 2025-05-07T20:31:43.5004337Z > y_fp8, y_scale = fn() 2025-05-07T20:31:43.5004634Z 2025-05-07T20:31:43.5004809Z moe/activation_test.py:117: 2025-05-07T20:31:43.5005332Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:43.5005900Z moe/activation_test.py:115: in fn 2025-05-07T20:31:43.5006377Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:43.5007358Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:43.5008345Z return fn(*args, **kwargs) 2025-05-07T20:31:43.5009506Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:43.5010714Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:43.5011641Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:43.5012820Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:43.5013991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:43.5014927Z kernel = self.compile( 2025-05-07T20:31:43.5015875Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:43.5017032Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:43.5017729Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:43.5018127Z 2025-05-07T20:31:43.5018450Z self = 2025-05-07T20:31:43.5020521Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:43.5022890Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c54c495a0>} 2025-05-07T20:31:43.5025327Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:43.5027067Z context = 2025-05-07T20:31:43.5027568Z 2025-05-07T20:31:43.5027832Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:43.5028713Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:43.5029514Z module_map=module_map) 2025-05-07T20:31:43.5030097Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:43.5030688Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:43.5031286Z E ^ 2025-05-07T20:31:43.5032176Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:43.5032955Z 2025-05-07T20:31:43.5033618Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:43.5034530Z 2025-05-07T20:31:43.5034699Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:43.5035393Z self=, 2025-05-07T20:31:43.5036060Z T=1, 2025-05-07T20:31:43.5036358Z D=7168, 2025-05-07T20:31:43.5036664Z scale_ub=None, 2025-05-07T20:31:43.5037006Z contiguous=False, 2025-05-07T20:31:43.5037374Z compiled=False, 2025-05-07T20:31:43.5037705Z ) 2025-05-07T20:31:43.5038225Z self = 2025-05-07T20:31:43.5039052Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:43.5039510Z 2025-05-07T20:31:43.5039630Z @given( 2025-05-07T20:31:43.5040010Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:43.5040513Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:43.5041021Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:43.5041573Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:43.5042118Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:43.5042593Z ) 2025-05-07T20:31:43.5043179Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:43.5043918Z def test_silu_mul_quant( 2025-05-07T20:31:43.5044314Z self, 2025-05-07T20:31:43.5044625Z T: int, 2025-05-07T20:31:43.5044933Z D: int, 2025-05-07T20:31:43.5045286Z scale_ub: Optional[float], 2025-05-07T20:31:43.5045782Z contiguous: bool, 2025-05-07T20:31:43.5046162Z compiled: bool, 2025-05-07T20:31:43.5046529Z ) -> None: 2025-05-07T20:31:43.5046877Z torch.manual_seed(2025) 2025-05-07T20:31:43.5047274Z 2025-05-07T20:31:43.5047714Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:43.5048294Z 2025-05-07T20:31:43.5048600Z x_sign = torch.sign(x) 2025-05-07T20:31:43.5049073Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:43.5049591Z x = x_sign * x_clamp 2025-05-07T20:31:43.5049980Z x0 = x[:, :D] 2025-05-07T20:31:43.5050320Z x1 = x[:, D:] 2025-05-07T20:31:43.5050663Z 2025-05-07T20:31:43.5050963Z if contiguous: 2025-05-07T20:31:43.5051331Z x0 = x0.contiguous() 2025-05-07T20:31:43.5051761Z x1 = x1.contiguous() 2025-05-07T20:31:43.5052146Z 2025-05-07T20:31:43.5052448Z if scale_ub is not None: 2025-05-07T20:31:43.5052893Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:43.5053439Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:43.5053964Z ) 2025-05-07T20:31:43.5054265Z else: 2025-05-07T20:31:43.5054613Z scale_ub_tensor = None 2025-05-07T20:31:43.5055034Z 2025-05-07T20:31:43.5055396Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:43.5055925Z op = silu_mul_quant 2025-05-07T20:31:43.5056334Z if compiled: 2025-05-07T20:31:43.5056720Z op = torch.compile(op) 2025-05-07T20:31:43.5057207Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:43.5057659Z 2025-05-07T20:31:43.5057967Z > y_fp8, y_scale = fn() 2025-05-07T20:31:43.5058251Z 2025-05-07T20:31:43.5058411Z moe/activation_test.py:117: 2025-05-07T20:31:43.5058902Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:43.5059432Z moe/activation_test.py:115: in fn 2025-05-07T20:31:43.5059992Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:43.5061196Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:43.5062703Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:43.5063636Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:43.5064811Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:43.5065979Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:43.5066917Z kernel = self.compile( 2025-05-07T20:31:43.5067864Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:43.5069022Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:43.5069691Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:43.5070085Z 2025-05-07T20:31:43.5070452Z self = 2025-05-07T20:31:43.5072379Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:43.5074733Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c54c49d80>} 2025-05-07T20:31:43.5077076Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:43.5078815Z context = 2025-05-07T20:31:43.5079206Z 2025-05-07T20:31:43.5079470Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:43.5080348Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:43.5081174Z module_map=module_map) 2025-05-07T20:31:43.5081787Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:43.5082370Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:43.5082804Z E ^ 2025-05-07T20:31:43.5083610Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:43.5084414Z 2025-05-07T20:31:43.5085167Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:43.5086086Z 2025-05-07T20:31:43.5086257Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:43.5086965Z self=, 2025-05-07T20:31:43.5087659Z T=2048, 2025-05-07T20:31:43.5087959Z D=7168, 2025-05-07T20:31:43.5088283Z scale_ub=None, 2025-05-07T20:31:43.5088642Z contiguous=False, 2025-05-07T20:31:43.5089008Z compiled=True, 2025-05-07T20:31:43.5089361Z ) 2025-05-07T20:31:43.6087885Z self = 2025-05-07T20:31:43.6088841Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:43.6089339Z 2025-05-07T20:31:43.6089469Z @given( 2025-05-07T20:31:43.6090138Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:43.6090655Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:43.6091087Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:43.6091621Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:43.6092181Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:43.6092662Z ) 2025-05-07T20:31:43.6093252Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:43.6094476Z def test_silu_mul_quant( 2025-05-07T20:31:43.6094879Z self, 2025-05-07T20:31:43.6095187Z T: int, 2025-05-07T20:31:43.6095713Z D: int, 2025-05-07T20:31:43.6096081Z scale_ub: Optional[float], 2025-05-07T20:31:43.6096529Z contiguous: bool, 2025-05-07T20:31:43.6096928Z compiled: bool, 2025-05-07T20:31:43.6097299Z ) -> None: 2025-05-07T20:31:43.6097640Z torch.manual_seed(2025) 2025-05-07T20:31:43.6098044Z 2025-05-07T20:31:43.6098494Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:43.6099065Z 2025-05-07T20:31:43.6099379Z x_sign = torch.sign(x) 2025-05-07T20:31:43.6099976Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:43.6100499Z x = x_sign * x_clamp 2025-05-07T20:31:43.6100904Z x0 = x[:, :D] 2025-05-07T20:31:43.6101258Z x1 = x[:, D:] 2025-05-07T20:31:43.6101598Z 2025-05-07T20:31:43.6101894Z if contiguous: 2025-05-07T20:31:43.6102289Z x0 = x0.contiguous() 2025-05-07T20:31:43.6102721Z x1 = x1.contiguous() 2025-05-07T20:31:43.6103113Z 2025-05-07T20:31:43.6103437Z if scale_ub is not None: 2025-05-07T20:31:43.6103892Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:43.6104444Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:43.6104966Z ) 2025-05-07T20:31:43.6105281Z else: 2025-05-07T20:31:43.6105613Z scale_ub_tensor = None 2025-05-07T20:31:43.6106032Z 2025-05-07T20:31:43.6106409Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:43.6106925Z op = silu_mul_quant 2025-05-07T20:31:43.6107344Z if compiled: 2025-05-07T20:31:43.6107748Z op = torch.compile(op) 2025-05-07T20:31:43.6108230Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:43.6108689Z 2025-05-07T20:31:43.6109000Z > y_fp8, y_scale = fn() 2025-05-07T20:31:43.6109274Z 2025-05-07T20:31:43.6109448Z moe/activation_test.py:117: 2025-05-07T20:31:43.6109935Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:43.6110505Z moe/activation_test.py:115: in fn 2025-05-07T20:31:43.6110974Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:43.6111933Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:43.6112893Z return fn(*args, **kwargs) 2025-05-07T20:31:43.6113999Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:43.6115179Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:43.6116083Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:43.6117242Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:43.6118361Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:43.6119285Z kernel = self.compile( 2025-05-07T20:31:43.6120246Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:43.6121419Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:43.6122099Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:43.6122499Z 2025-05-07T20:31:43.6122850Z self = 2025-05-07T20:31:43.6124791Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:43.6127365Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c54c4af80>} 2025-05-07T20:31:43.6130046Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:43.6131895Z context = 2025-05-07T20:31:43.6132355Z 2025-05-07T20:31:43.6132619Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:43.6133522Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:43.6134339Z module_map=module_map) 2025-05-07T20:31:43.6134945Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:43.6135545Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:43.6135979Z E ^ 2025-05-07T20:31:43.6136790Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:43.6137615Z 2025-05-07T20:31:43.6138368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:43.6139297Z 2025-05-07T20:31:43.6139470Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:43.6140300Z self=, 2025-05-07T20:31:43.6140995Z T=4096, 2025-05-07T20:31:43.6141302Z D=7168, 2025-05-07T20:31:43.6141618Z scale_ub=None, 2025-05-07T20:31:43.6141965Z contiguous=False, 2025-05-07T20:31:43.6142334Z compiled=True, 2025-05-07T20:31:43.6142671Z ) 2025-05-07T20:31:43.6143207Z self = 2025-05-07T20:31:43.6144049Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:43.6144529Z 2025-05-07T20:31:43.6144651Z @given( 2025-05-07T20:31:43.6145032Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:43.6145548Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:43.6146072Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:43.6146630Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:43.6147182Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:43.6147669Z ) 2025-05-07T20:31:43.6148267Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:43.6149030Z def test_silu_mul_quant( 2025-05-07T20:31:43.6149428Z self, 2025-05-07T20:31:43.6149748Z T: int, 2025-05-07T20:31:43.6150069Z D: int, 2025-05-07T20:31:43.6150418Z scale_ub: Optional[float], 2025-05-07T20:31:43.6150872Z contiguous: bool, 2025-05-07T20:31:43.6151273Z compiled: bool, 2025-05-07T20:31:43.6151633Z ) -> None: 2025-05-07T20:31:43.6151978Z torch.manual_seed(2025) 2025-05-07T20:31:43.6152350Z 2025-05-07T20:31:43.6152738Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:43.6153191Z 2025-05-07T20:31:43.6153469Z x_sign = torch.sign(x) 2025-05-07T20:31:43.6153840Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:43.6154258Z x = x_sign * x_clamp 2025-05-07T20:31:43.6154587Z x0 = x[:, :D] 2025-05-07T20:31:43.6154877Z x1 = x[:, D:] 2025-05-07T20:31:43.6155163Z 2025-05-07T20:31:43.6155432Z if contiguous: 2025-05-07T20:31:43.6155732Z x0 = x0.contiguous() 2025-05-07T20:31:43.6156077Z x1 = x1.contiguous() 2025-05-07T20:31:43.6156414Z 2025-05-07T20:31:43.6156665Z if scale_ub is not None: 2025-05-07T20:31:43.6157012Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:43.6157479Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:43.6157937Z ) 2025-05-07T20:31:43.6158225Z else: 2025-05-07T20:31:43.6158689Z scale_ub_tensor = None 2025-05-07T20:31:43.6159057Z 2025-05-07T20:31:43.6159395Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:43.6159984Z op = silu_mul_quant 2025-05-07T20:31:43.6160361Z if compiled: 2025-05-07T20:31:43.6160716Z op = torch.compile(op) 2025-05-07T20:31:43.6161121Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:43.6171941Z 2025-05-07T20:31:43.6172264Z > y_fp8, y_scale = fn() 2025-05-07T20:31:43.6172551Z 2025-05-07T20:31:43.6172710Z moe/activation_test.py:117: 2025-05-07T20:31:43.6173195Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:43.6173729Z moe/activation_test.py:115: in fn 2025-05-07T20:31:43.6174188Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:43.6175135Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:43.6176080Z return fn(*args, **kwargs) 2025-05-07T20:31:43.6177221Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:43.6178404Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:43.6179307Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:43.6180532Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:43.6181656Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:43.6182556Z kernel = self.compile( 2025-05-07T20:31:43.6183466Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:43.6184567Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:43.6185218Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:43.6185603Z 2025-05-07T20:31:43.6185941Z self = 2025-05-07T20:31:43.6187803Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:43.6190478Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c54c4be20>} 2025-05-07T20:31:43.6192822Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:43.6194581Z context = 2025-05-07T20:31:43.6195060Z 2025-05-07T20:31:43.6195345Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:43.6196205Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:43.6196984Z module_map=module_map) 2025-05-07T20:31:43.6197565Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:43.6198135Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:43.6198541Z E ^ 2025-05-07T20:31:43.6199314Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:43.6200088Z 2025-05-07T20:31:43.6200801Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:43.6201682Z 2025-05-07T20:31:44.0021316Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:44.0021745Z self=, 2025-05-07T20:31:44.0022437Z T=16384, 2025-05-07T20:31:44.0022712Z D=5120, 2025-05-07T20:31:44.0022915Z scale_ub=1200.0, 2025-05-07T20:31:44.0023308Z contiguous=False, 2025-05-07T20:31:44.0023547Z compiled=False, 2025-05-07T20:31:44.0023761Z ) 2025-05-07T20:31:44.0024074Z self = 2025-05-07T20:31:44.0024575Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:44.0024857Z 2025-05-07T20:31:44.0024945Z @given( 2025-05-07T20:31:44.0025174Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:44.0025489Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:44.0025802Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:44.0026124Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:44.0026461Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:44.0026750Z ) 2025-05-07T20:31:44.0027107Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:44.0027547Z def test_silu_mul_quant( 2025-05-07T20:31:44.0027793Z self, 2025-05-07T20:31:44.0028002Z T: int, 2025-05-07T20:31:44.0028225Z D: int, 2025-05-07T20:31:44.0028441Z scale_ub: Optional[float], 2025-05-07T20:31:44.0028716Z contiguous: bool, 2025-05-07T20:31:44.0028958Z compiled: bool, 2025-05-07T20:31:44.0029187Z ) -> None: 2025-05-07T20:31:44.0029405Z torch.manual_seed(2025) 2025-05-07T20:31:44.0029648Z 2025-05-07T20:31:44.0029925Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:44.0030259Z 2025-05-07T20:31:44.0030462Z x_sign = torch.sign(x) 2025-05-07T20:31:44.0030757Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:44.0031066Z x = x_sign * x_clamp 2025-05-07T20:31:44.0031314Z x0 = x[:, :D] 2025-05-07T20:31:44.0031529Z x1 = x[:, D:] 2025-05-07T20:31:44.0031741Z 2025-05-07T20:31:44.0031941Z if contiguous: 2025-05-07T20:31:44.0032178Z x0 = x0.contiguous() 2025-05-07T20:31:44.0032437Z x1 = x1.contiguous() 2025-05-07T20:31:44.0032683Z 2025-05-07T20:31:44.0032884Z if scale_ub is not None: 2025-05-07T20:31:44.0033155Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:44.0033541Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:44.0033884Z ) 2025-05-07T20:31:44.0034087Z else: 2025-05-07T20:31:44.0034339Z scale_ub_tensor = None 2025-05-07T20:31:44.0034592Z 2025-05-07T20:31:44.0034820Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:44.0035137Z op = silu_mul_quant 2025-05-07T20:31:44.0035394Z if compiled: 2025-05-07T20:31:44.0035645Z op = torch.compile(op) 2025-05-07T20:31:44.0035951Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:44.0036223Z 2025-05-07T20:31:44.0036415Z > y_fp8, y_scale = fn() 2025-05-07T20:31:44.0036584Z 2025-05-07T20:31:44.0036683Z moe/activation_test.py:117: 2025-05-07T20:31:44.0036984Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:44.0037314Z moe/activation_test.py:115: in fn 2025-05-07T20:31:44.0037593Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:44.0038286Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:44.0038974Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:44.0039505Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:44.0040181Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:44.0040840Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:44.0041368Z kernel = self.compile( 2025-05-07T20:31:44.0042093Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:44.0042746Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:44.0043138Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:44.0043363Z 2025-05-07T20:31:44.0043576Z self = 2025-05-07T20:31:44.0044638Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:44.0046061Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c54b517e0>} 2025-05-07T20:31:44.0047407Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:44.0048423Z context = 2025-05-07T20:31:44.0048707Z 2025-05-07T20:31:44.0048881Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:44.0049397Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:44.0049861Z module_map=module_map) 2025-05-07T20:31:44.0050226Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:44.0050582Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:44.0050838Z E ^ 2025-05-07T20:31:44.0051298Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:44.0051740Z 2025-05-07T20:31:44.0052166Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:44.0052673Z 2025-05-07T20:31:44.0052779Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:44.0053190Z self=, 2025-05-07T20:31:44.0053593Z T=16384, 2025-05-07T20:31:44.0053787Z D=5120, 2025-05-07T20:31:44.0053980Z scale_ub=1200.0, 2025-05-07T20:31:44.0054201Z contiguous=True, 2025-05-07T20:31:44.0054425Z compiled=True, 2025-05-07T20:31:44.0054625Z ) 2025-05-07T20:31:44.0054944Z self = 2025-05-07T20:31:44.0055437Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:44.0055710Z 2025-05-07T20:31:44.0055784Z @given( 2025-05-07T20:31:44.0056018Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:44.0056331Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:44.0056645Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:44.0056979Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:44.0057310Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:44.0057594Z ) 2025-05-07T20:31:44.0057940Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:44.0058379Z def test_silu_mul_quant( 2025-05-07T20:31:44.0058622Z self, 2025-05-07T20:31:44.0058809Z T: int, 2025-05-07T20:31:44.0059007Z D: int, 2025-05-07T20:31:44.0059226Z scale_ub: Optional[float], 2025-05-07T20:31:44.0059495Z contiguous: bool, 2025-05-07T20:31:44.0059735Z compiled: bool, 2025-05-07T20:31:44.0060059Z ) -> None: 2025-05-07T20:31:44.0060271Z torch.manual_seed(2025) 2025-05-07T20:31:44.0060518Z 2025-05-07T20:31:44.0060789Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:44.0061215Z 2025-05-07T20:31:44.0061412Z x_sign = torch.sign(x) 2025-05-07T20:31:44.0061703Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:44.0062080Z x = x_sign * x_clamp 2025-05-07T20:31:44.0062330Z x0 = x[:, :D] 2025-05-07T20:31:44.0062547Z x1 = x[:, D:] 2025-05-07T20:31:44.0062755Z 2025-05-07T20:31:44.0062941Z if contiguous: 2025-05-07T20:31:44.0063178Z x0 = x0.contiguous() 2025-05-07T20:31:44.0063438Z x1 = x1.contiguous() 2025-05-07T20:31:44.0063667Z 2025-05-07T20:31:44.0063862Z if scale_ub is not None: 2025-05-07T20:31:44.0064137Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:44.0064467Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:44.0064773Z ) 2025-05-07T20:31:44.0064967Z else: 2025-05-07T20:31:44.0065177Z scale_ub_tensor = None 2025-05-07T20:31:44.0065429Z 2025-05-07T20:31:44.0065685Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:44.0066024Z op = silu_mul_quant 2025-05-07T20:31:44.0066275Z if compiled: 2025-05-07T20:31:44.0066532Z op = torch.compile(op) 2025-05-07T20:31:44.0066823Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:44.0067098Z 2025-05-07T20:31:44.0067297Z > y_fp8, y_scale = fn() 2025-05-07T20:31:44.0067461Z 2025-05-07T20:31:44.0067570Z moe/activation_test.py:117: 2025-05-07T20:31:44.0067861Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:44.0068196Z moe/activation_test.py:115: in fn 2025-05-07T20:31:44.0068482Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:44.0069031Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:44.0069589Z return fn(*args, **kwargs) 2025-05-07T20:31:44.0070250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:44.0070941Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:44.0071473Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:44.0072144Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:44.0072814Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:44.0073334Z kernel = self.compile( 2025-05-07T20:31:44.0073869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:44.0074524Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:44.0074916Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:44.0075140Z 2025-05-07T20:31:44.0075346Z self = 2025-05-07T20:31:44.0076426Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:44.0077778Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c54b51090>} 2025-05-07T20:31:44.0079115Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:44.0080124Z context = 2025-05-07T20:31:44.0080407Z 2025-05-07T20:31:44.0080573Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:44.0081275Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:44.0081831Z module_map=module_map) 2025-05-07T20:31:44.0082261Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:44.0082612Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:44.0082869Z E ^ 2025-05-07T20:31:44.0083326Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:44.0083774Z 2025-05-07T20:31:44.0084187Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:44.0084697Z 2025-05-07T20:31:44.1972605Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:44.1973103Z self=, 2025-05-07T20:31:44.1973505Z T=16384, 2025-05-07T20:31:44.1973695Z D=5120, 2025-05-07T20:31:44.1973890Z scale_ub=None, 2025-05-07T20:31:44.1974143Z contiguous=False, 2025-05-07T20:31:44.1974365Z compiled=True, 2025-05-07T20:31:44.1974573Z ) 2025-05-07T20:31:44.1974908Z self = 2025-05-07T20:31:44.1975408Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:44.1975688Z 2025-05-07T20:31:44.1975766Z @given( 2025-05-07T20:31:44.1976000Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:44.1976316Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:44.1976621Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:44.1976956Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:44.1977286Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:44.1977568Z ) 2025-05-07T20:31:44.1977927Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:44.1978370Z def test_silu_mul_quant( 2025-05-07T20:31:44.1978617Z self, 2025-05-07T20:31:44.1978809Z T: int, 2025-05-07T20:31:44.1979011Z D: int, 2025-05-07T20:31:44.1979238Z scale_ub: Optional[float], 2025-05-07T20:31:44.1979512Z contiguous: bool, 2025-05-07T20:31:44.1979757Z compiled: bool, 2025-05-07T20:31:44.1980121Z ) -> None: 2025-05-07T20:31:44.1980343Z torch.manual_seed(2025) 2025-05-07T20:31:44.1980594Z 2025-05-07T20:31:44.1980872Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:44.1981211Z 2025-05-07T20:31:44.1981413Z x_sign = torch.sign(x) 2025-05-07T20:31:44.1981709Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:44.1982014Z x = x_sign * x_clamp 2025-05-07T20:31:44.1982256Z x0 = x[:, :D] 2025-05-07T20:31:44.1982476Z x1 = x[:, D:] 2025-05-07T20:31:44.1982681Z 2025-05-07T20:31:44.1982873Z if contiguous: 2025-05-07T20:31:44.1983111Z x0 = x0.contiguous() 2025-05-07T20:31:44.1983371Z x1 = x1.contiguous() 2025-05-07T20:31:44.1983613Z 2025-05-07T20:31:44.1983814Z if scale_ub is not None: 2025-05-07T20:31:44.1984094Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:44.1984426Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:44.1984738Z ) 2025-05-07T20:31:44.1984934Z else: 2025-05-07T20:31:44.1985146Z scale_ub_tensor = None 2025-05-07T20:31:44.1985406Z 2025-05-07T20:31:44.1985646Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:44.1985962Z op = silu_mul_quant 2025-05-07T20:31:44.1986225Z if compiled: 2025-05-07T20:31:44.1986476Z op = torch.compile(op) 2025-05-07T20:31:44.1986770Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:44.1987047Z 2025-05-07T20:31:44.1987245Z > y_fp8, y_scale = fn() 2025-05-07T20:31:44.1987410Z 2025-05-07T20:31:44.1987512Z moe/activation_test.py:117: 2025-05-07T20:31:44.1988171Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:44.1988704Z moe/activation_test.py:115: in fn 2025-05-07T20:31:44.1988992Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:44.1989563Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:44.1990400Z return fn(*args, **kwargs) 2025-05-07T20:31:44.1991062Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:44.1991747Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:44.1992291Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:44.1992974Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:44.1993639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:44.1994186Z kernel = self.compile( 2025-05-07T20:31:44.1994739Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:44.1995406Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:44.1995805Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:44.1996041Z 2025-05-07T20:31:44.1996248Z self = 2025-05-07T20:31:44.1997321Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:44.1998701Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c54b52290>} 2025-05-07T20:31:44.2000053Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:44.2001067Z context = 2025-05-07T20:31:44.2001360Z 2025-05-07T20:31:44.2001529Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:44.2002051Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:44.2002520Z module_map=module_map) 2025-05-07T20:31:44.2002885Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:44.2003242Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:44.2003502Z E ^ 2025-05-07T20:31:44.2003959Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:44.2004423Z 2025-05-07T20:31:44.2004844Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:44.2005369Z 2025-05-07T20:31:44.2005477Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:44.2005941Z self=, 2025-05-07T20:31:44.2006337Z T=2048, 2025-05-07T20:31:44.2006531Z D=5120, 2025-05-07T20:31:44.2006727Z scale_ub=None, 2025-05-07T20:31:44.2006942Z contiguous=False, 2025-05-07T20:31:44.2007172Z compiled=True, 2025-05-07T20:31:44.2007373Z ) 2025-05-07T20:31:44.3038515Z self = 2025-05-07T20:31:44.3039044Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:44.3039326Z 2025-05-07T20:31:44.3039409Z @given( 2025-05-07T20:31:44.3039646Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:44.3040442Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:44.3040806Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:44.3041301Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:44.3041635Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:44.3041955Z ) 2025-05-07T20:31:44.3042302Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:44.3042748Z def test_silu_mul_quant( 2025-05-07T20:31:44.3042995Z self, 2025-05-07T20:31:44.3043188Z T: int, 2025-05-07T20:31:44.3043393Z D: int, 2025-05-07T20:31:44.3043617Z scale_ub: Optional[float], 2025-05-07T20:31:44.3043886Z contiguous: bool, 2025-05-07T20:31:44.3044131Z compiled: bool, 2025-05-07T20:31:44.3044364Z ) -> None: 2025-05-07T20:31:44.3044579Z torch.manual_seed(2025) 2025-05-07T20:31:44.3044827Z 2025-05-07T20:31:44.3045101Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:44.3045445Z 2025-05-07T20:31:44.3045643Z x_sign = torch.sign(x) 2025-05-07T20:31:44.3045950Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:44.3046261Z x = x_sign * x_clamp 2025-05-07T20:31:44.3046502Z x0 = x[:, :D] 2025-05-07T20:31:44.3046724Z x1 = x[:, D:] 2025-05-07T20:31:44.3046935Z 2025-05-07T20:31:44.3047119Z if contiguous: 2025-05-07T20:31:44.3047356Z x0 = x0.contiguous() 2025-05-07T20:31:44.3047621Z x1 = x1.contiguous() 2025-05-07T20:31:44.3047855Z 2025-05-07T20:31:44.3048053Z if scale_ub is not None: 2025-05-07T20:31:44.3048328Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:44.3048663Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:44.3048976Z ) 2025-05-07T20:31:44.3049169Z else: 2025-05-07T20:31:44.3049381Z scale_ub_tensor = None 2025-05-07T20:31:44.3049642Z 2025-05-07T20:31:44.3049879Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:44.3050188Z op = silu_mul_quant 2025-05-07T20:31:44.3050446Z if compiled: 2025-05-07T20:31:44.3050707Z op = torch.compile(op) 2025-05-07T20:31:44.3051009Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:44.3051283Z 2025-05-07T20:31:44.3051481Z > y_fp8, y_scale = fn() 2025-05-07T20:31:44.3051646Z 2025-05-07T20:31:44.3051756Z moe/activation_test.py:117: 2025-05-07T20:31:44.3052050Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:44.3052384Z moe/activation_test.py:115: in fn 2025-05-07T20:31:44.3052671Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:44.3053225Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:44.3053785Z return fn(*args, **kwargs) 2025-05-07T20:31:44.3054449Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:44.3055154Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:44.3055688Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:44.3056367Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:44.3057034Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:44.3057565Z kernel = self.compile( 2025-05-07T20:31:44.3058101Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:44.3058767Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:44.3059165Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:44.3059394Z 2025-05-07T20:31:44.3059695Z self = 2025-05-07T20:31:44.3060944Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:44.3062338Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c54b52170>} 2025-05-07T20:31:44.3063689Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:44.3064712Z context = 2025-05-07T20:31:44.3065003Z 2025-05-07T20:31:44.3065172Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:44.3065717Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:44.3066191Z module_map=module_map) 2025-05-07T20:31:44.3066553Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:44.3066913Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:44.3067181Z E ^ 2025-05-07T20:31:44.3067648Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:44.3068095Z 2025-05-07T20:31:44.3068509Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:44.3069026Z 2025-05-07T20:31:44.3069133Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:44.3069564Z self=, 2025-05-07T20:31:44.3069975Z T=2048, 2025-05-07T20:31:44.3070165Z D=5120, 2025-05-07T20:31:44.3070363Z scale_ub=1200.0, 2025-05-07T20:31:44.3070598Z contiguous=False, 2025-05-07T20:31:44.3070822Z compiled=True, 2025-05-07T20:31:44.3071040Z ) 2025-05-07T20:31:44.3071364Z self = 2025-05-07T20:31:44.3080524Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:44.3080816Z 2025-05-07T20:31:44.3080907Z @given( 2025-05-07T20:31:44.3081148Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:44.3081473Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:44.3081791Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:44.3082126Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:44.3082469Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:44.3082766Z ) 2025-05-07T20:31:44.3083121Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:44.3083587Z def test_silu_mul_quant( 2025-05-07T20:31:44.3083848Z self, 2025-05-07T20:31:44.3084057Z T: int, 2025-05-07T20:31:44.3084264Z D: int, 2025-05-07T20:31:44.3084496Z scale_ub: Optional[float], 2025-05-07T20:31:44.3084787Z contiguous: bool, 2025-05-07T20:31:44.3085037Z compiled: bool, 2025-05-07T20:31:44.3085276Z ) -> None: 2025-05-07T20:31:44.3085508Z torch.manual_seed(2025) 2025-05-07T20:31:44.3085757Z 2025-05-07T20:31:44.3086042Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:44.3086395Z 2025-05-07T20:31:44.3086593Z x_sign = torch.sign(x) 2025-05-07T20:31:44.3086895Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:44.3087215Z x = x_sign * x_clamp 2025-05-07T20:31:44.3087461Z x0 = x[:, :D] 2025-05-07T20:31:44.3087690Z x1 = x[:, D:] 2025-05-07T20:31:44.3087909Z 2025-05-07T20:31:44.3088100Z if contiguous: 2025-05-07T20:31:44.3088465Z x0 = x0.contiguous() 2025-05-07T20:31:44.3088740Z x1 = x1.contiguous() 2025-05-07T20:31:44.3088981Z 2025-05-07T20:31:44.3089257Z if scale_ub is not None: 2025-05-07T20:31:44.3089544Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:44.3090191Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:44.3090510Z ) 2025-05-07T20:31:44.3090718Z else: 2025-05-07T20:31:44.3090944Z scale_ub_tensor = None 2025-05-07T20:31:44.3091204Z 2025-05-07T20:31:44.3091440Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:44.3091765Z op = silu_mul_quant 2025-05-07T20:31:44.3092015Z if compiled: 2025-05-07T20:31:44.3092268Z op = torch.compile(op) 2025-05-07T20:31:44.3092575Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:44.3092850Z 2025-05-07T20:31:44.3093058Z > y_fp8, y_scale = fn() 2025-05-07T20:31:44.3093234Z 2025-05-07T20:31:44.3093346Z moe/activation_test.py:117: 2025-05-07T20:31:44.3093651Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:44.3093997Z moe/activation_test.py:115: in fn 2025-05-07T20:31:44.3094295Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:44.3094864Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:44.3095428Z return fn(*args, **kwargs) 2025-05-07T20:31:44.3096160Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:44.3096870Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:44.3097411Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:44.3098103Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:44.3098781Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:44.3099322Z kernel = self.compile( 2025-05-07T20:31:44.3099970Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:44.3100641Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:44.3101049Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:44.3101280Z 2025-05-07T20:31:44.3101497Z self = 2025-05-07T20:31:44.3102570Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:44.3103942Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c54b53880>} 2025-05-07T20:31:44.3105314Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:44.3106340Z context = 2025-05-07T20:31:44.3106629Z 2025-05-07T20:31:44.3106809Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:44.3107328Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:44.3107794Z module_map=module_map) 2025-05-07T20:31:44.3108168Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:44.3108520Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:44.3108786Z E ^ 2025-05-07T20:31:44.3109244Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:44.3110453Z 2025-05-07T20:31:44.3110977Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:44.3111487Z 2025-05-07T20:31:44.5003525Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:44.5004003Z self=, 2025-05-07T20:31:44.5004415Z T=4096, 2025-05-07T20:31:44.5004598Z D=5120, 2025-05-07T20:31:44.5004794Z scale_ub=1200.0, 2025-05-07T20:31:44.5005022Z contiguous=True, 2025-05-07T20:31:44.5005244Z compiled=True, 2025-05-07T20:31:44.5005452Z ) 2025-05-07T20:31:44.5005774Z self = 2025-05-07T20:31:44.5006267Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:44.5006541Z 2025-05-07T20:31:44.5006619Z @given( 2025-05-07T20:31:44.5006884Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:44.5007190Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:44.5007512Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:44.5007846Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:44.5008176Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:44.5008455Z ) 2025-05-07T20:31:44.5008820Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:44.5009261Z def test_silu_mul_quant( 2025-05-07T20:31:44.5009580Z self, 2025-05-07T20:31:44.5009838Z T: int, 2025-05-07T20:31:44.5010110Z D: int, 2025-05-07T20:31:44.5010340Z scale_ub: Optional[float], 2025-05-07T20:31:44.5010633Z contiguous: bool, 2025-05-07T20:31:44.5010934Z compiled: bool, 2025-05-07T20:31:44.5011166Z ) -> None: 2025-05-07T20:31:44.5011395Z torch.manual_seed(2025) 2025-05-07T20:31:44.5011655Z 2025-05-07T20:31:44.5011942Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:44.5012503Z 2025-05-07T20:31:44.5012717Z x_sign = torch.sign(x) 2025-05-07T20:31:44.5013020Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:44.5013348Z x = x_sign * x_clamp 2025-05-07T20:31:44.5013601Z x0 = x[:, :D] 2025-05-07T20:31:44.5013829Z x1 = x[:, D:] 2025-05-07T20:31:44.5014050Z 2025-05-07T20:31:44.5014249Z if contiguous: 2025-05-07T20:31:44.5014487Z x0 = x0.contiguous() 2025-05-07T20:31:44.5014763Z x1 = x1.contiguous() 2025-05-07T20:31:44.5015015Z 2025-05-07T20:31:44.5015215Z if scale_ub is not None: 2025-05-07T20:31:44.5015502Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:44.5015850Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:44.5016167Z ) 2025-05-07T20:31:44.5016374Z else: 2025-05-07T20:31:44.5016612Z scale_ub_tensor = None 2025-05-07T20:31:44.5016873Z 2025-05-07T20:31:44.5017123Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:44.5017447Z op = silu_mul_quant 2025-05-07T20:31:44.5017706Z if compiled: 2025-05-07T20:31:44.5017969Z op = torch.compile(op) 2025-05-07T20:31:44.5018282Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:44.5018570Z 2025-05-07T20:31:44.5018769Z > y_fp8, y_scale = fn() 2025-05-07T20:31:44.5018946Z 2025-05-07T20:31:44.5019053Z moe/activation_test.py:117: 2025-05-07T20:31:44.5019363Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:44.5019706Z moe/activation_test.py:115: in fn 2025-05-07T20:31:44.5020107Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:44.5020693Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:44.5021258Z return fn(*args, **kwargs) 2025-05-07T20:31:44.5022296Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:44.5023004Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:44.5023554Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:44.5024241Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:44.5024917Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:44.5025491Z kernel = self.compile( 2025-05-07T20:31:44.5026276Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:44.5026947Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:44.5027363Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:44.5027603Z 2025-05-07T20:31:44.5027830Z self = 2025-05-07T20:31:44.5028917Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:44.5030326Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c4fe44940>} 2025-05-07T20:31:44.5031673Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:44.5032703Z context = 2025-05-07T20:31:44.5032995Z 2025-05-07T20:31:44.5033179Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:44.5033710Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:44.5034187Z module_map=module_map) 2025-05-07T20:31:44.5034565Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:44.5034932Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:44.5035197Z E ^ 2025-05-07T20:31:44.5035678Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:44.5036129Z 2025-05-07T20:31:44.5036552Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:44.5037076Z 2025-05-07T20:31:44.5037184Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:44.5037607Z self=, 2025-05-07T20:31:44.5038024Z T=128, 2025-05-07T20:31:44.5038225Z D=5120, 2025-05-07T20:31:44.5038425Z scale_ub=1200.0, 2025-05-07T20:31:44.5038670Z contiguous=False, 2025-05-07T20:31:44.5038910Z compiled=True, 2025-05-07T20:31:44.5039119Z ) 2025-05-07T20:31:44.6185016Z self = 2025-05-07T20:31:44.6185668Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:44.6185978Z 2025-05-07T20:31:44.6186059Z @given( 2025-05-07T20:31:44.6186300Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:44.6186609Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:44.6186920Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:44.6187251Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:44.6187580Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:44.6187878Z ) 2025-05-07T20:31:44.6188236Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:44.6189024Z def test_silu_mul_quant( 2025-05-07T20:31:44.6189262Z self, 2025-05-07T20:31:44.6189625Z T: int, 2025-05-07T20:31:44.6190157Z D: int, 2025-05-07T20:31:44.6190396Z scale_ub: Optional[float], 2025-05-07T20:31:44.6190675Z contiguous: bool, 2025-05-07T20:31:44.6190925Z compiled: bool, 2025-05-07T20:31:44.6191153Z ) -> None: 2025-05-07T20:31:44.6191379Z torch.manual_seed(2025) 2025-05-07T20:31:44.6191633Z 2025-05-07T20:31:44.6191908Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:44.6192259Z 2025-05-07T20:31:44.6192463Z x_sign = torch.sign(x) 2025-05-07T20:31:44.6192762Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:44.6193083Z x = x_sign * x_clamp 2025-05-07T20:31:44.6193333Z x0 = x[:, :D] 2025-05-07T20:31:44.6193551Z x1 = x[:, D:] 2025-05-07T20:31:44.6193781Z 2025-05-07T20:31:44.6193981Z if contiguous: 2025-05-07T20:31:44.6194218Z x0 = x0.contiguous() 2025-05-07T20:31:44.6194496Z x1 = x1.contiguous() 2025-05-07T20:31:44.6194745Z 2025-05-07T20:31:44.6194942Z if scale_ub is not None: 2025-05-07T20:31:44.6195224Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:44.6195567Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:44.6195885Z ) 2025-05-07T20:31:44.6196080Z else: 2025-05-07T20:31:44.6196300Z scale_ub_tensor = None 2025-05-07T20:31:44.6196557Z 2025-05-07T20:31:44.6196790Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:44.6197108Z op = silu_mul_quant 2025-05-07T20:31:44.6197366Z if compiled: 2025-05-07T20:31:44.6197618Z op = torch.compile(op) 2025-05-07T20:31:44.6197923Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:44.6198203Z 2025-05-07T20:31:44.6198405Z > y_fp8, y_scale = fn() 2025-05-07T20:31:44.6198579Z 2025-05-07T20:31:44.6198682Z moe/activation_test.py:117: 2025-05-07T20:31:44.6198988Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:44.6199326Z moe/activation_test.py:115: in fn 2025-05-07T20:31:44.6199609Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:44.6200173Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:44.6200747Z return fn(*args, **kwargs) 2025-05-07T20:31:44.6201405Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:44.6202097Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:44.6202636Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:44.6203321Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:44.6203990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:44.6204525Z kernel = self.compile( 2025-05-07T20:31:44.6205071Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:44.6205721Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:44.6206122Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:44.6206358Z 2025-05-07T20:31:44.6206568Z self = 2025-05-07T20:31:44.6207640Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:44.6209157Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c4fe451b0>} 2025-05-07T20:31:44.6210599Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:44.6211621Z context = 2025-05-07T20:31:44.6211908Z 2025-05-07T20:31:44.6212086Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:44.6212611Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:44.6213077Z module_map=module_map) 2025-05-07T20:31:44.6213450Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:44.6213812Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:44.6214082Z E ^ 2025-05-07T20:31:44.6214565Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:44.6215027Z 2025-05-07T20:31:44.6215443Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:44.6215959Z 2025-05-07T20:31:44.6216071Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:44.6216486Z self=, 2025-05-07T20:31:44.6216892Z T=16384, 2025-05-07T20:31:44.6217094Z D=7168, 2025-05-07T20:31:44.6217292Z scale_ub=1200.0, 2025-05-07T20:31:44.6217524Z contiguous=True, 2025-05-07T20:31:44.6217753Z compiled=True, 2025-05-07T20:31:44.6217970Z ) 2025-05-07T20:31:44.6218291Z self = 2025-05-07T20:31:44.6218793Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:44.6219076Z 2025-05-07T20:31:44.6219164Z @given( 2025-05-07T20:31:44.6219399Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:44.6219730Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:44.6220117Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:44.6220448Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:44.6220788Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:44.6221081Z ) 2025-05-07T20:31:44.6221441Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:44.6221886Z def test_silu_mul_quant( 2025-05-07T20:31:44.6222137Z self, 2025-05-07T20:31:44.6222339Z T: int, 2025-05-07T20:31:44.6222543Z D: int, 2025-05-07T20:31:44.6222771Z scale_ub: Optional[float], 2025-05-07T20:31:44.6223047Z contiguous: bool, 2025-05-07T20:31:44.6223291Z compiled: bool, 2025-05-07T20:31:44.6223522Z ) -> None: 2025-05-07T20:31:44.6223754Z torch.manual_seed(2025) 2025-05-07T20:31:44.6223998Z 2025-05-07T20:31:44.6224283Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:44.6224636Z 2025-05-07T20:31:44.6224832Z x_sign = torch.sign(x) 2025-05-07T20:31:44.6225130Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:44.6225444Z x = x_sign * x_clamp 2025-05-07T20:31:44.6225689Z x0 = x[:, :D] 2025-05-07T20:31:44.6225913Z x1 = x[:, D:] 2025-05-07T20:31:44.6226131Z 2025-05-07T20:31:44.6226319Z if contiguous: 2025-05-07T20:31:44.6226559Z x0 = x0.contiguous() 2025-05-07T20:31:44.6226827Z x1 = x1.contiguous() 2025-05-07T20:31:44.6227072Z 2025-05-07T20:31:44.6227269Z if scale_ub is not None: 2025-05-07T20:31:44.6227551Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:44.6227896Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:44.6228298Z ) 2025-05-07T20:31:44.6228502Z else: 2025-05-07T20:31:44.6228726Z scale_ub_tensor = None 2025-05-07T20:31:44.6228980Z 2025-05-07T20:31:44.6229296Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:44.6229622Z op = silu_mul_quant 2025-05-07T20:31:44.6229877Z if compiled: 2025-05-07T20:31:44.6230136Z op = torch.compile(op) 2025-05-07T20:31:44.6230441Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:44.6230719Z 2025-05-07T20:31:44.6230923Z > y_fp8, y_scale = fn() 2025-05-07T20:31:44.6231092Z 2025-05-07T20:31:44.6231208Z moe/activation_test.py:117: 2025-05-07T20:31:44.6231513Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:44.6231845Z moe/activation_test.py:115: in fn 2025-05-07T20:31:44.6232137Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:44.6232699Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:44.6233270Z return fn(*args, **kwargs) 2025-05-07T20:31:44.6233941Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:44.6234637Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:44.6235182Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:44.6235860Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:44.6236527Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:44.6237064Z kernel = self.compile( 2025-05-07T20:31:44.6237603Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:44.6238263Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:44.6238671Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:44.6238901Z 2025-05-07T20:31:44.6239121Z self = 2025-05-07T20:31:44.6240191Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:44.6241568Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c4fe457e0>} 2025-05-07T20:31:44.6242907Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:44.6243930Z context = 2025-05-07T20:31:44.6244221Z 2025-05-07T20:31:44.6244400Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:44.6244928Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:44.6245403Z module_map=module_map) 2025-05-07T20:31:44.6245775Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:44.6246129Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:44.6246397Z E ^ 2025-05-07T20:31:44.6246868Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:44.6247316Z 2025-05-07T20:31:44.6247739Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:44.6248257Z 2025-05-07T20:31:44.9741303Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:44.9741789Z self=, 2025-05-07T20:31:44.9742751Z T=16384, 2025-05-07T20:31:44.9742969Z D=5120, 2025-05-07T20:31:44.9743330Z scale_ub=1200.0, 2025-05-07T20:31:44.9743563Z contiguous=True, 2025-05-07T20:31:44.9743793Z compiled=False, 2025-05-07T20:31:44.9744042Z ) 2025-05-07T20:31:44.9744363Z self = 2025-05-07T20:31:44.9744872Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:44.9745164Z 2025-05-07T20:31:44.9745248Z @given( 2025-05-07T20:31:44.9745489Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:44.9745802Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:44.9746112Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:44.9746448Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:44.9746776Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:44.9747076Z ) 2025-05-07T20:31:44.9747434Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:44.9747890Z def test_silu_mul_quant( 2025-05-07T20:31:44.9748148Z self, 2025-05-07T20:31:44.9748352Z T: int, 2025-05-07T20:31:44.9748562Z D: int, 2025-05-07T20:31:44.9748785Z scale_ub: Optional[float], 2025-05-07T20:31:44.9749065Z contiguous: bool, 2025-05-07T20:31:44.9749316Z compiled: bool, 2025-05-07T20:31:44.9749545Z ) -> None: 2025-05-07T20:31:44.9749774Z torch.manual_seed(2025) 2025-05-07T20:31:44.9750027Z 2025-05-07T20:31:44.9750303Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:44.9750655Z 2025-05-07T20:31:44.9750863Z x_sign = torch.sign(x) 2025-05-07T20:31:44.9751157Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:44.9751473Z x = x_sign * x_clamp 2025-05-07T20:31:44.9751741Z x0 = x[:, :D] 2025-05-07T20:31:44.9751975Z x1 = x[:, D:] 2025-05-07T20:31:44.9752185Z 2025-05-07T20:31:44.9752382Z if contiguous: 2025-05-07T20:31:44.9752630Z x0 = x0.contiguous() 2025-05-07T20:31:44.9752892Z x1 = x1.contiguous() 2025-05-07T20:31:44.9753145Z 2025-05-07T20:31:44.9753351Z if scale_ub is not None: 2025-05-07T20:31:44.9753626Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:44.9762616Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:44.9762947Z ) 2025-05-07T20:31:44.9763157Z else: 2025-05-07T20:31:44.9763387Z scale_ub_tensor = None 2025-05-07T20:31:44.9763646Z 2025-05-07T20:31:44.9763898Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:44.9764229Z op = silu_mul_quant 2025-05-07T20:31:44.9764489Z if compiled: 2025-05-07T20:31:44.9764754Z op = torch.compile(op) 2025-05-07T20:31:44.9765062Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:44.9765359Z 2025-05-07T20:31:44.9765558Z > y_fp8, y_scale = fn() 2025-05-07T20:31:44.9765735Z 2025-05-07T20:31:44.9765844Z moe/activation_test.py:117: 2025-05-07T20:31:44.9766156Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:44.9766496Z moe/activation_test.py:115: in fn 2025-05-07T20:31:44.9766792Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:44.9767490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:44.9768183Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:44.9768725Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:44.9769420Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:44.9770088Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:44.9770746Z kernel = self.compile( 2025-05-07T20:31:44.9771372Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:44.9772043Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:44.9772448Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:44.9772679Z 2025-05-07T20:31:44.9772892Z self = 2025-05-07T20:31:44.9773971Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:44.9775347Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c4fe46950>} 2025-05-07T20:31:44.9776713Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:44.9777730Z context = 2025-05-07T20:31:44.9778025Z 2025-05-07T20:31:44.9778198Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:44.9778725Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:44.9779200Z module_map=module_map) 2025-05-07T20:31:44.9779570Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:44.9780084Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:44.9780351Z E ^ 2025-05-07T20:31:44.9780809Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:44.9781262Z 2025-05-07T20:31:44.9781679Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:44.9782200Z 2025-05-07T20:31:44.9782308Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:44.9782720Z self=, 2025-05-07T20:31:44.9783114Z T=1, 2025-05-07T20:31:44.9783303Z D=7168, 2025-05-07T20:31:44.9783502Z scale_ub=1200.0, 2025-05-07T20:31:44.9783729Z contiguous=False, 2025-05-07T20:31:44.9783959Z compiled=False, 2025-05-07T20:31:44.9784172Z ) 2025-05-07T20:31:44.9784487Z self = 2025-05-07T20:31:44.9784977Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:44.9785254Z 2025-05-07T20:31:44.9785334Z @given( 2025-05-07T20:31:44.9785576Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:44.9785894Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:44.9786222Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:44.9786555Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:44.9786882Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:44.9787181Z ) 2025-05-07T20:31:44.9787537Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:44.9787975Z def test_silu_mul_quant( 2025-05-07T20:31:44.9788228Z self, 2025-05-07T20:31:44.9788431Z T: int, 2025-05-07T20:31:44.9788628Z D: int, 2025-05-07T20:31:44.9788855Z scale_ub: Optional[float], 2025-05-07T20:31:44.9789134Z contiguous: bool, 2025-05-07T20:31:44.9789384Z compiled: bool, 2025-05-07T20:31:44.9789609Z ) -> None: 2025-05-07T20:31:44.9790160Z torch.manual_seed(2025) 2025-05-07T20:31:44.9790412Z 2025-05-07T20:31:44.9790691Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:44.9791204Z 2025-05-07T20:31:44.9791416Z x_sign = torch.sign(x) 2025-05-07T20:31:44.9791809Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:44.9792132Z x = x_sign * x_clamp 2025-05-07T20:31:44.9792369Z x0 = x[:, :D] 2025-05-07T20:31:44.9792583Z x1 = x[:, D:] 2025-05-07T20:31:44.9792785Z 2025-05-07T20:31:44.9792970Z if contiguous: 2025-05-07T20:31:44.9793208Z x0 = x0.contiguous() 2025-05-07T20:31:44.9793472Z x1 = x1.contiguous() 2025-05-07T20:31:44.9793707Z 2025-05-07T20:31:44.9793903Z if scale_ub is not None: 2025-05-07T20:31:44.9794179Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:44.9794509Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:44.9794821Z ) 2025-05-07T20:31:44.9795018Z else: 2025-05-07T20:31:44.9795230Z scale_ub_tensor = None 2025-05-07T20:31:44.9795496Z 2025-05-07T20:31:44.9795734Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:44.9796053Z op = silu_mul_quant 2025-05-07T20:31:44.9796302Z if compiled: 2025-05-07T20:31:44.9796554Z op = torch.compile(op) 2025-05-07T20:31:44.9796855Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:44.9797127Z 2025-05-07T20:31:44.9797324Z > y_fp8, y_scale = fn() 2025-05-07T20:31:44.9797487Z 2025-05-07T20:31:44.9797594Z moe/activation_test.py:117: 2025-05-07T20:31:44.9797889Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:44.9798219Z moe/activation_test.py:115: in fn 2025-05-07T20:31:44.9798510Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:44.9799191Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:44.9799878Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:44.9800431Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:44.9801113Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:44.9801766Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:44.9802300Z kernel = self.compile( 2025-05-07T20:31:44.9802843Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:44.9803496Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:44.9803886Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:44.9804117Z 2025-05-07T20:31:44.9804325Z self = 2025-05-07T20:31:44.9805401Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:44.9806781Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c4fe47ac0>} 2025-05-07T20:31:44.9808104Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:44.9809116Z context = 2025-05-07T20:31:44.9809405Z 2025-05-07T20:31:44.9809571Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:44.9810091Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:44.9810548Z module_map=module_map) 2025-05-07T20:31:44.9811003Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:44.9811432Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:44.9811688Z E ^ 2025-05-07T20:31:44.9812152Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:44.9812600Z 2025-05-07T20:31:44.9813012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:44.9813516Z 2025-05-07T20:31:45.1726720Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.1727218Z self=, 2025-05-07T20:31:45.1727644Z T=4096, 2025-05-07T20:31:45.1727841Z D=7168, 2025-05-07T20:31:45.1728048Z scale_ub=1200.0, 2025-05-07T20:31:45.1728283Z contiguous=False, 2025-05-07T20:31:45.1728512Z compiled=True, 2025-05-07T20:31:45.1728762Z ) 2025-05-07T20:31:45.1729089Z self = 2025-05-07T20:31:45.1729597Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:45.1729891Z 2025-05-07T20:31:45.1729995Z @given( 2025-05-07T20:31:45.1730234Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.1730556Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.1730866Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.1731203Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.1731539Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.1731826Z ) 2025-05-07T20:31:45.1732183Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.1732630Z def test_silu_mul_quant( 2025-05-07T20:31:45.1732881Z self, 2025-05-07T20:31:45.1733078Z T: int, 2025-05-07T20:31:45.1733285Z D: int, 2025-05-07T20:31:45.1733517Z scale_ub: Optional[float], 2025-05-07T20:31:45.1733787Z contiguous: bool, 2025-05-07T20:31:45.1734038Z compiled: bool, 2025-05-07T20:31:45.1734274Z ) -> None: 2025-05-07T20:31:45.1734492Z torch.manual_seed(2025) 2025-05-07T20:31:45.1734747Z 2025-05-07T20:31:45.1735029Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.1735374Z 2025-05-07T20:31:45.1735578Z x_sign = torch.sign(x) 2025-05-07T20:31:45.1735877Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.1736189Z x = x_sign * x_clamp 2025-05-07T20:31:45.1736439Z x0 = x[:, :D] 2025-05-07T20:31:45.1736668Z x1 = x[:, D:] 2025-05-07T20:31:45.1736879Z 2025-05-07T20:31:45.1737076Z if contiguous: 2025-05-07T20:31:45.1737323Z x0 = x0.contiguous() 2025-05-07T20:31:45.1737599Z x1 = x1.contiguous() 2025-05-07T20:31:45.1737843Z 2025-05-07T20:31:45.1738048Z if scale_ub is not None: 2025-05-07T20:31:45.1738335Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.1738680Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.1738997Z ) 2025-05-07T20:31:45.1739203Z else: 2025-05-07T20:31:45.1739415Z scale_ub_tensor = None 2025-05-07T20:31:45.1739674Z 2025-05-07T20:31:45.1739996Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.1740309Z op = silu_mul_quant 2025-05-07T20:31:45.1740568Z if compiled: 2025-05-07T20:31:45.1740823Z op = torch.compile(op) 2025-05-07T20:31:45.1741117Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.1741395Z 2025-05-07T20:31:45.1741592Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.1741757Z 2025-05-07T20:31:45.1741865Z moe/activation_test.py:117: 2025-05-07T20:31:45.1742161Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.1742864Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.1743150Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.1743869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:45.1744433Z return fn(*args, **kwargs) 2025-05-07T20:31:45.1745096Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.1745779Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.1746317Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.1747002Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.1747662Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.1748191Z kernel = self.compile( 2025-05-07T20:31:45.1748744Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.1749414Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.1749814Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.1750044Z 2025-05-07T20:31:45.1750251Z self = 2025-05-07T20:31:45.1751323Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.1752721Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c4fc10550>} 2025-05-07T20:31:45.1754070Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.1755091Z context = 2025-05-07T20:31:45.1755389Z 2025-05-07T20:31:45.1755557Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.1756130Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.1756599Z module_map=module_map) 2025-05-07T20:31:45.1756960Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.1757318Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.1757576Z E ^ 2025-05-07T20:31:45.1758038Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.1758494Z 2025-05-07T20:31:45.1758908Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.1759439Z 2025-05-07T20:31:45.1759552Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.1759970Z self=, 2025-05-07T20:31:45.1760367Z T=128, 2025-05-07T20:31:45.1760565Z D=7168, 2025-05-07T20:31:45.1760766Z scale_ub=1200.0, 2025-05-07T20:31:45.1760989Z contiguous=False, 2025-05-07T20:31:45.1761218Z compiled=True, 2025-05-07T20:31:45.1761425Z ) 2025-05-07T20:31:45.2793009Z self = 2025-05-07T20:31:45.2793565Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:45.2793839Z 2025-05-07T20:31:45.2793921Z @given( 2025-05-07T20:31:45.2794159Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.2794474Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.2794773Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.2795452Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.2795917Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.2796203Z ) 2025-05-07T20:31:45.2796556Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.2797001Z def test_silu_mul_quant( 2025-05-07T20:31:45.2797246Z self, 2025-05-07T20:31:45.2797440Z T: int, 2025-05-07T20:31:45.2797638Z D: int, 2025-05-07T20:31:45.2797863Z scale_ub: Optional[float], 2025-05-07T20:31:45.2798133Z contiguous: bool, 2025-05-07T20:31:45.2798375Z compiled: bool, 2025-05-07T20:31:45.2798602Z ) -> None: 2025-05-07T20:31:45.2798817Z torch.manual_seed(2025) 2025-05-07T20:31:45.2799061Z 2025-05-07T20:31:45.2799339Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.2799678Z 2025-05-07T20:31:45.2799875Z x_sign = torch.sign(x) 2025-05-07T20:31:45.2800176Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.2800486Z x = x_sign * x_clamp 2025-05-07T20:31:45.2800729Z x0 = x[:, :D] 2025-05-07T20:31:45.2800947Z x1 = x[:, D:] 2025-05-07T20:31:45.2801151Z 2025-05-07T20:31:45.2801339Z if contiguous: 2025-05-07T20:31:45.2801583Z x0 = x0.contiguous() 2025-05-07T20:31:45.2801839Z x1 = x1.contiguous() 2025-05-07T20:31:45.2802080Z 2025-05-07T20:31:45.2802280Z if scale_ub is not None: 2025-05-07T20:31:45.2802548Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.2802886Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.2803194Z ) 2025-05-07T20:31:45.2803390Z else: 2025-05-07T20:31:45.2803601Z scale_ub_tensor = None 2025-05-07T20:31:45.2803858Z 2025-05-07T20:31:45.2804093Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.2804413Z op = silu_mul_quant 2025-05-07T20:31:45.2804673Z if compiled: 2025-05-07T20:31:45.2804931Z op = torch.compile(op) 2025-05-07T20:31:45.2805227Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.2805501Z 2025-05-07T20:31:45.2805703Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.2805868Z 2025-05-07T20:31:45.2805970Z moe/activation_test.py:117: 2025-05-07T20:31:45.2806274Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.2806610Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.2806894Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.2807454Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:45.2808015Z return fn(*args, **kwargs) 2025-05-07T20:31:45.2808673Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.2809359Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.2809899Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.2810577Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.2811243Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.2811765Z kernel = self.compile( 2025-05-07T20:31:45.2812305Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.2812961Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.2813352Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.2813589Z 2025-05-07T20:31:45.2813796Z self = 2025-05-07T20:31:45.2815023Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.2816464Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c4fc10f70>} 2025-05-07T20:31:45.2817792Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.2818802Z context = 2025-05-07T20:31:45.2819092Z 2025-05-07T20:31:45.2819261Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.2819782Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.2820344Z module_map=module_map) 2025-05-07T20:31:45.2820718Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.2821077Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.2821341Z E ^ 2025-05-07T20:31:45.2821807Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.2822259Z 2025-05-07T20:31:45.2822674Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.2823195Z 2025-05-07T20:31:45.2823302Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.2823722Z self=, 2025-05-07T20:31:45.2824120Z T=2048, 2025-05-07T20:31:45.2824318Z D=7168, 2025-05-07T20:31:45.2824524Z scale_ub=None, 2025-05-07T20:31:45.2824763Z contiguous=True, 2025-05-07T20:31:45.2824994Z compiled=True, 2025-05-07T20:31:45.2825210Z ) 2025-05-07T20:31:45.2825546Z self = 2025-05-07T20:31:45.2826036Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:45.2826315Z 2025-05-07T20:31:45.2826395Z @given( 2025-05-07T20:31:45.2826637Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.2826946Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.2827257Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.2827595Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.2827931Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.2828218Z ) 2025-05-07T20:31:45.2828576Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.2829025Z def test_silu_mul_quant( 2025-05-07T20:31:45.2829269Z self, 2025-05-07T20:31:45.2829477Z T: int, 2025-05-07T20:31:45.2829682Z D: int, 2025-05-07T20:31:45.2829905Z scale_ub: Optional[float], 2025-05-07T20:31:45.2830189Z contiguous: bool, 2025-05-07T20:31:45.2830434Z compiled: bool, 2025-05-07T20:31:45.2830658Z ) -> None: 2025-05-07T20:31:45.2830881Z torch.manual_seed(2025) 2025-05-07T20:31:45.2831133Z 2025-05-07T20:31:45.2831405Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.2831751Z 2025-05-07T20:31:45.2831951Z x_sign = torch.sign(x) 2025-05-07T20:31:45.2832249Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.2832556Z x = x_sign * x_clamp 2025-05-07T20:31:45.2832803Z x0 = x[:, :D] 2025-05-07T20:31:45.2833024Z x1 = x[:, D:] 2025-05-07T20:31:45.2833231Z 2025-05-07T20:31:45.2833427Z if contiguous: 2025-05-07T20:31:45.2833667Z x0 = x0.contiguous() 2025-05-07T20:31:45.2833928Z x1 = x1.contiguous() 2025-05-07T20:31:45.2834266Z 2025-05-07T20:31:45.2834476Z if scale_ub is not None: 2025-05-07T20:31:45.2834823Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.2835173Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.2835487Z ) 2025-05-07T20:31:45.2835682Z else: 2025-05-07T20:31:45.2835900Z scale_ub_tensor = None 2025-05-07T20:31:45.2836158Z 2025-05-07T20:31:45.2836394Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.2836717Z op = silu_mul_quant 2025-05-07T20:31:45.2836976Z if compiled: 2025-05-07T20:31:45.2837233Z op = torch.compile(op) 2025-05-07T20:31:45.2837533Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.2837814Z 2025-05-07T20:31:45.2838013Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.2838179Z 2025-05-07T20:31:45.2838283Z moe/activation_test.py:117: 2025-05-07T20:31:45.2838586Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.2838933Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.2839227Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.2839793Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:45.2840363Z return fn(*args, **kwargs) 2025-05-07T20:31:45.2841026Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.2841711Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.2842255Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.2842939Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.2843604Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.2844149Z kernel = self.compile( 2025-05-07T20:31:45.2844702Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.2845366Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.2845766Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.2846011Z 2025-05-07T20:31:45.2846262Z self = 2025-05-07T20:31:45.2847345Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.2848716Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c4fc11bd0>} 2025-05-07T20:31:45.2850067Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.2851086Z context = 2025-05-07T20:31:45.2851403Z 2025-05-07T20:31:45.2851576Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.2852101Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.2852572Z module_map=module_map) 2025-05-07T20:31:45.2852937Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.2853300Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.2853570Z E ^ 2025-05-07T20:31:45.2854036Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.2854613Z 2025-05-07T20:31:45.2855030Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.2855630Z 2025-05-07T20:31:45.3634765Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.3635589Z self=, 2025-05-07T20:31:45.3636323Z T=16384, 2025-05-07T20:31:45.3636664Z D=5120, 2025-05-07T20:31:45.3636980Z scale_ub=None, 2025-05-07T20:31:45.3637331Z contiguous=False, 2025-05-07T20:31:45.3637707Z compiled=False, 2025-05-07T20:31:45.3638060Z ) 2025-05-07T20:31:45.3638592Z self = 2025-05-07T20:31:45.3639450Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:45.3639939Z 2025-05-07T20:31:45.3640064Z @given( 2025-05-07T20:31:45.3640442Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.3640958Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.3641504Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.3642082Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.3642630Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.3643123Z ) 2025-05-07T20:31:45.3643729Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.3644489Z def test_silu_mul_quant( 2025-05-07T20:31:45.3644944Z self, 2025-05-07T20:31:45.3645260Z T: int, 2025-05-07T20:31:45.3645586Z D: int, 2025-05-07T20:31:45.3645951Z scale_ub: Optional[float], 2025-05-07T20:31:45.3646401Z contiguous: bool, 2025-05-07T20:31:45.3646790Z compiled: bool, 2025-05-07T20:31:45.3647170Z ) -> None: 2025-05-07T20:31:45.3647517Z torch.manual_seed(2025) 2025-05-07T20:31:45.3647928Z 2025-05-07T20:31:45.3648384Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.3648956Z 2025-05-07T20:31:45.3649275Z x_sign = torch.sign(x) 2025-05-07T20:31:45.3649762Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.3653399Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 40.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:45.3656667Z 2025-05-07T20:31:45.3656868Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:31:45.3657229Z 2025-05-07T20:31:45.3657405Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.3658085Z self=, 2025-05-07T20:31:45.3658775Z T=4096, 2025-05-07T20:31:45.3659079Z D=7168, 2025-05-07T20:31:45.3659388Z scale_ub=1200.0, 2025-05-07T20:31:45.3659757Z contiguous=True, 2025-05-07T20:31:45.3660204Z compiled=True, 2025-05-07T20:31:45.3660530Z ) 2025-05-07T20:31:45.3661051Z self = 2025-05-07T20:31:45.3661879Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:45.3662339Z 2025-05-07T20:31:45.3662470Z @given( 2025-05-07T20:31:45.3662836Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.3663362Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.3663868Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.3664412Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.3664968Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.3665842Z ) 2025-05-07T20:31:45.3666476Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.3667418Z def test_silu_mul_quant( 2025-05-07T20:31:45.3667827Z self, 2025-05-07T20:31:45.3668133Z T: int, 2025-05-07T20:31:45.3668454Z D: int, 2025-05-07T20:31:45.3668810Z scale_ub: Optional[float], 2025-05-07T20:31:45.3669250Z contiguous: bool, 2025-05-07T20:31:45.3669650Z compiled: bool, 2025-05-07T20:31:45.3670013Z ) -> None: 2025-05-07T20:31:45.3670365Z torch.manual_seed(2025) 2025-05-07T20:31:45.3670758Z 2025-05-07T20:31:45.3671215Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.3671791Z 2025-05-07T20:31:45.3672098Z x_sign = torch.sign(x) 2025-05-07T20:31:45.3672580Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.3676167Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:45.3679545Z 2025-05-07T20:31:45.3679757Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:31:45.3680120Z 2025-05-07T20:31:45.3680299Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.3680985Z self=, 2025-05-07T20:31:45.3681663Z T=16384, 2025-05-07T20:31:45.3681969Z D=7168, 2025-05-07T20:31:45.3682277Z scale_ub=None, 2025-05-07T20:31:45.3682615Z contiguous=False, 2025-05-07T20:31:45.3682980Z compiled=False, 2025-05-07T20:31:45.3683319Z ) 2025-05-07T20:31:45.3683834Z self = 2025-05-07T20:31:45.3684672Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:45.3685154Z 2025-05-07T20:31:45.3685277Z @given( 2025-05-07T20:31:45.3685647Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.3686160Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.3686665Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.3687209Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.3687750Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.3688225Z ) 2025-05-07T20:31:45.3688817Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.3689578Z def test_silu_mul_quant( 2025-05-07T20:31:45.3690296Z self, 2025-05-07T20:31:45.3690617Z T: int, 2025-05-07T20:31:45.3690947Z D: int, 2025-05-07T20:31:45.3691289Z scale_ub: Optional[float], 2025-05-07T20:31:45.3691736Z contiguous: bool, 2025-05-07T20:31:45.3692142Z compiled: bool, 2025-05-07T20:31:45.3692501Z ) -> None: 2025-05-07T20:31:45.3692850Z torch.manual_seed(2025) 2025-05-07T20:31:45.3693254Z 2025-05-07T20:31:45.3693699Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.3697499Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:45.3701248Z 2025-05-07T20:31:45.3701441Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:45.3701814Z 2025-05-07T20:31:45.3702161Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.3702868Z self=, 2025-05-07T20:31:45.3703549Z T=2048, 2025-05-07T20:31:45.3703853Z D=7168, 2025-05-07T20:31:45.3704161Z scale_ub=1200.0, 2025-05-07T20:31:45.3704514Z contiguous=True, 2025-05-07T20:31:45.3704882Z compiled=True, 2025-05-07T20:31:45.3705218Z ) 2025-05-07T20:31:45.3705709Z self = 2025-05-07T20:31:45.3706487Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:45.3706944Z 2025-05-07T20:31:45.3707077Z @given( 2025-05-07T20:31:45.3707440Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.3707947Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.3708470Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.3709029Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.3709583Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.3710068Z ) 2025-05-07T20:31:45.3710661Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.3711410Z def test_silu_mul_quant( 2025-05-07T20:31:45.3711808Z self, 2025-05-07T20:31:45.3712118Z T: int, 2025-05-07T20:31:45.3712433Z D: int, 2025-05-07T20:31:45.3712789Z scale_ub: Optional[float], 2025-05-07T20:31:45.3713242Z contiguous: bool, 2025-05-07T20:31:45.3713627Z compiled: bool, 2025-05-07T20:31:45.3713993Z ) -> None: 2025-05-07T20:31:45.3714342Z torch.manual_seed(2025) 2025-05-07T20:31:45.3714745Z 2025-05-07T20:31:45.3715181Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.3715756Z 2025-05-07T20:31:45.3716075Z x_sign = torch.sign(x) 2025-05-07T20:31:45.3716547Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.3720167Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:45.3723570Z 2025-05-07T20:31:45.3723767Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:31:45.3724138Z 2025-05-07T20:31:45.3724314Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.3725022Z self=, 2025-05-07T20:31:45.3725708Z T=2048, 2025-05-07T20:31:45.3726012Z D=7168, 2025-05-07T20:31:45.3726382Z scale_ub=None, 2025-05-07T20:31:45.3726721Z contiguous=True, 2025-05-07T20:31:45.3727086Z compiled=False, 2025-05-07T20:31:45.3727419Z ) 2025-05-07T20:31:45.4990498Z self = 2025-05-07T20:31:45.4991378Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:45.4991822Z 2025-05-07T20:31:45.4991953Z @given( 2025-05-07T20:31:45.4992321Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.4992810Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.4993306Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.4993879Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.4994386Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.4994816Z ) 2025-05-07T20:31:45.4995828Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.4996541Z def test_silu_mul_quant( 2025-05-07T20:31:45.4997143Z self, 2025-05-07T20:31:45.4997468Z T: int, 2025-05-07T20:31:45.4997770Z D: int, 2025-05-07T20:31:45.4998091Z scale_ub: Optional[float], 2025-05-07T20:31:45.4998518Z contiguous: bool, 2025-05-07T20:31:45.4998923Z compiled: bool, 2025-05-07T20:31:45.4999292Z ) -> None: 2025-05-07T20:31:45.4999636Z torch.manual_seed(2025) 2025-05-07T20:31:45.5000030Z 2025-05-07T20:31:45.5000462Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.5001041Z 2025-05-07T20:31:45.5001347Z > x_sign = torch.sign(x) 2025-05-07T20:31:45.5004805Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:45.5008207Z 2025-05-07T20:31:45.5008424Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:31:45.5008796Z 2025-05-07T20:31:45.5008968Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.5009676Z self=, 2025-05-07T20:31:45.5010370Z T=1, 2025-05-07T20:31:45.5010666Z D=7168, 2025-05-07T20:31:45.5010982Z scale_ub=1200.0, 2025-05-07T20:31:45.5011351Z contiguous=True, 2025-05-07T20:31:45.5011713Z compiled=False, 2025-05-07T20:31:45.5012049Z ) 2025-05-07T20:31:45.5012577Z self = 2025-05-07T20:31:45.5013407Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:45.5013852Z 2025-05-07T20:31:45.5013984Z @given( 2025-05-07T20:31:45.5014353Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.5014877Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.5015383Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.5015937Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.5016504Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.5016983Z ) 2025-05-07T20:31:45.5017581Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.5018340Z def test_silu_mul_quant( 2025-05-07T20:31:45.5018747Z self, 2025-05-07T20:31:45.5019058Z T: int, 2025-05-07T20:31:45.5019385Z D: int, 2025-05-07T20:31:45.5019747Z scale_ub: Optional[float], 2025-05-07T20:31:45.5020304Z contiguous: bool, 2025-05-07T20:31:45.5020718Z compiled: bool, 2025-05-07T20:31:45.5021087Z ) -> None: 2025-05-07T20:31:45.5021445Z torch.manual_seed(2025) 2025-05-07T20:31:45.5021839Z 2025-05-07T20:31:45.5022275Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.5022839Z 2025-05-07T20:31:45.5023144Z x_sign = torch.sign(x) 2025-05-07T20:31:45.5023609Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.5024097Z x = x_sign * x_clamp 2025-05-07T20:31:45.5024471Z x0 = x[:, :D] 2025-05-07T20:31:45.5024790Z x1 = x[:, D:] 2025-05-07T20:31:45.5025122Z 2025-05-07T20:31:45.5025421Z if contiguous: 2025-05-07T20:31:45.5025781Z x0 = x0.contiguous() 2025-05-07T20:31:45.5026241Z x1 = x1.contiguous() 2025-05-07T20:31:45.5026652Z 2025-05-07T20:31:45.5026963Z if scale_ub is not None: 2025-05-07T20:31:45.5027413Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.5028104Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.5028615Z ) 2025-05-07T20:31:45.5029076Z else: 2025-05-07T20:31:45.5029417Z scale_ub_tensor = None 2025-05-07T20:31:45.5029827Z 2025-05-07T20:31:45.5030209Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.5030729Z op = silu_mul_quant 2025-05-07T20:31:45.5031138Z if compiled: 2025-05-07T20:31:45.5031538Z op = torch.compile(op) 2025-05-07T20:31:45.5032017Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.5032475Z 2025-05-07T20:31:45.5032767Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.5033023Z 2025-05-07T20:31:45.5033182Z moe/activation_test.py:117: 2025-05-07T20:31:45.5033631Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.5034172Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.5034647Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.5035842Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.5037052Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.5037985Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.5039175Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.5040317Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.5041253Z kernel = self.compile( 2025-05-07T20:31:45.5042167Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.5043303Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.5043976Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.5044391Z 2025-05-07T20:31:45.5044730Z self = 2025-05-07T20:31:45.5046655Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.5049101Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c4fc13b50>} 2025-05-07T20:31:45.5051500Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.5053306Z context = 2025-05-07T20:31:45.5053811Z 2025-05-07T20:31:45.5054096Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.5055005Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.5055808Z module_map=module_map) 2025-05-07T20:31:45.5056398Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.5056978Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.5057396Z E ^ 2025-05-07T20:31:45.5058204Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.5059015Z 2025-05-07T20:31:45.5059749Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.5060759Z 2025-05-07T20:31:45.5060938Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.5061633Z self=, 2025-05-07T20:31:45.5062458Z T=128, 2025-05-07T20:31:45.5062755Z D=5120, 2025-05-07T20:31:45.5063049Z scale_ub=None, 2025-05-07T20:31:45.5063503Z contiguous=True, 2025-05-07T20:31:45.5063869Z compiled=False, 2025-05-07T20:31:45.5064186Z ) 2025-05-07T20:31:45.5845574Z self = 2025-05-07T20:31:45.5846525Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:45.5846959Z 2025-05-07T20:31:45.5847070Z @given( 2025-05-07T20:31:45.5847396Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.5847865Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.5848331Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.5848868Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.5849400Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.5849867Z ) 2025-05-07T20:31:45.5850448Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.5851232Z def test_silu_mul_quant( 2025-05-07T20:31:45.5851631Z self, 2025-05-07T20:31:45.5851964Z T: int, 2025-05-07T20:31:45.5852286Z D: int, 2025-05-07T20:31:45.5852630Z scale_ub: Optional[float], 2025-05-07T20:31:45.5853076Z contiguous: bool, 2025-05-07T20:31:45.5853465Z compiled: bool, 2025-05-07T20:31:45.5853831Z ) -> None: 2025-05-07T20:31:45.5854169Z torch.manual_seed(2025) 2025-05-07T20:31:45.5854567Z 2025-05-07T20:31:45.5855013Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.5855575Z 2025-05-07T20:31:45.5855888Z x_sign = torch.sign(x) 2025-05-07T20:31:45.5856374Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.5856892Z x = x_sign * x_clamp 2025-05-07T20:31:45.5857290Z x0 = x[:, :D] 2025-05-07T20:31:45.5857642Z x1 = x[:, D:] 2025-05-07T20:31:45.5857968Z 2025-05-07T20:31:45.5858276Z if contiguous: 2025-05-07T20:31:45.5858654Z x0 = x0.contiguous() 2025-05-07T20:31:45.5859085Z x1 = x1.contiguous() 2025-05-07T20:31:45.5859490Z 2025-05-07T20:31:45.5859914Z if scale_ub is not None: 2025-05-07T20:31:45.5860375Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.5860948Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.5861476Z ) 2025-05-07T20:31:45.5861791Z else: 2025-05-07T20:31:45.5862128Z scale_ub_tensor = None 2025-05-07T20:31:45.5862547Z 2025-05-07T20:31:45.5862926Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.5863455Z op = silu_mul_quant 2025-05-07T20:31:45.5863870Z if compiled: 2025-05-07T20:31:45.5864286Z op = torch.compile(op) 2025-05-07T20:31:45.5864780Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.5865240Z 2025-05-07T20:31:45.5865554Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.5865829Z 2025-05-07T20:31:45.5865985Z moe/activation_test.py:117: 2025-05-07T20:31:45.5866476Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.5867028Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.5867485Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.5868686Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.5869929Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.5870867Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.5872070Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.5873245Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.5874203Z kernel = self.compile( 2025-05-07T20:31:45.5875767Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.5876917Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.5877621Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.5878001Z 2025-05-07T20:31:45.5878354Z self = 2025-05-07T20:31:45.5880242Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.5882664Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c4fa54670>} 2025-05-07T20:31:45.5885072Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.5886906Z context = 2025-05-07T20:31:45.5887398Z 2025-05-07T20:31:45.5887673Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.5888556Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.5889355Z module_map=module_map) 2025-05-07T20:31:45.5890183Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.5890779Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.5891203Z E ^ 2025-05-07T20:31:45.5891991Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.5892787Z 2025-05-07T20:31:45.5893522Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.5894433Z 2025-05-07T20:31:45.5894602Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.5895305Z self=, 2025-05-07T20:31:45.5895983Z T=128, 2025-05-07T20:31:45.5896279Z D=7168, 2025-05-07T20:31:45.5896579Z scale_ub=None, 2025-05-07T20:31:45.5896930Z contiguous=True, 2025-05-07T20:31:45.5897293Z compiled=False, 2025-05-07T20:31:45.5897617Z ) 2025-05-07T20:31:45.5898141Z self = 2025-05-07T20:31:45.5898978Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:45.5899440Z 2025-05-07T20:31:45.5899560Z @given( 2025-05-07T20:31:45.5900021Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.5900547Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.5901057Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.5901615Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.5902168Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.5902643Z ) 2025-05-07T20:31:45.5903221Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.5903981Z def test_silu_mul_quant( 2025-05-07T20:31:45.5904369Z self, 2025-05-07T20:31:45.5904672Z T: int, 2025-05-07T20:31:45.5904987Z D: int, 2025-05-07T20:31:45.5905337Z scale_ub: Optional[float], 2025-05-07T20:31:45.5905777Z contiguous: bool, 2025-05-07T20:31:45.5906178Z compiled: bool, 2025-05-07T20:31:45.5906559Z ) -> None: 2025-05-07T20:31:45.5906896Z torch.manual_seed(2025) 2025-05-07T20:31:45.5907314Z 2025-05-07T20:31:45.5907755Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.5908551Z 2025-05-07T20:31:45.5908852Z x_sign = torch.sign(x) 2025-05-07T20:31:45.5909326Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.5909985Z x = x_sign * x_clamp 2025-05-07T20:31:45.5910373Z x0 = x[:, :D] 2025-05-07T20:31:45.5910720Z x1 = x[:, D:] 2025-05-07T20:31:45.5911060Z 2025-05-07T20:31:45.5922335Z if contiguous: 2025-05-07T20:31:45.5922745Z x0 = x0.contiguous() 2025-05-07T20:31:45.5923190Z x1 = x1.contiguous() 2025-05-07T20:31:45.5923591Z 2025-05-07T20:31:45.5923911Z if scale_ub is not None: 2025-05-07T20:31:45.5924376Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.5924951Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.5925473Z ) 2025-05-07T20:31:45.5925797Z else: 2025-05-07T20:31:45.5926146Z scale_ub_tensor = None 2025-05-07T20:31:45.5926557Z 2025-05-07T20:31:45.5926941Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.5927489Z op = silu_mul_quant 2025-05-07T20:31:45.5927903Z if compiled: 2025-05-07T20:31:45.5928336Z op = torch.compile(op) 2025-05-07T20:31:45.5928821Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.5929291Z 2025-05-07T20:31:45.5929608Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.5929889Z 2025-05-07T20:31:45.5930061Z moe/activation_test.py:117: 2025-05-07T20:31:45.5930552Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.5931137Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.5931609Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.5932709Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.5933915Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.5934861Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.5936087Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.5937250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.5938192Z kernel = self.compile( 2025-05-07T20:31:45.5939139Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.5940431Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.5941105Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.5941509Z 2025-05-07T20:31:45.5941856Z self = 2025-05-07T20:31:45.5943775Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.5946269Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c4fa54ee0>} 2025-05-07T20:31:45.5948695Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.5950519Z context = 2025-05-07T20:31:45.5951033Z 2025-05-07T20:31:45.5951310Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.5952218Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.5953035Z module_map=module_map) 2025-05-07T20:31:45.5953636Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.5954381Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.5954812Z E ^ 2025-05-07T20:31:45.5955694Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.5956475Z 2025-05-07T20:31:45.5957188Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.5958070Z 2025-05-07T20:31:45.5958246Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.5958936Z self=, 2025-05-07T20:31:45.5959604Z T=2048, 2025-05-07T20:31:45.5959911Z D=7168, 2025-05-07T20:31:45.5960227Z scale_ub=1200.0, 2025-05-07T20:31:45.5960588Z contiguous=True, 2025-05-07T20:31:45.5960961Z compiled=False, 2025-05-07T20:31:45.5961305Z ) 2025-05-07T20:31:45.6908687Z self = 2025-05-07T20:31:45.6909591Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:45.6910064Z 2025-05-07T20:31:45.6910192Z @given( 2025-05-07T20:31:45.6910528Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.6911007Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.6911501Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.6912054Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.6912619Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.6913104Z ) 2025-05-07T20:31:45.6913681Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.6914451Z def test_silu_mul_quant( 2025-05-07T20:31:45.6914873Z self, 2025-05-07T20:31:45.6915194Z T: int, 2025-05-07T20:31:45.6915525Z D: int, 2025-05-07T20:31:45.6915884Z scale_ub: Optional[float], 2025-05-07T20:31:45.6916336Z contiguous: bool, 2025-05-07T20:31:45.6916746Z compiled: bool, 2025-05-07T20:31:45.6917099Z ) -> None: 2025-05-07T20:31:45.6917448Z torch.manual_seed(2025) 2025-05-07T20:31:45.6917848Z 2025-05-07T20:31:45.6918278Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.6921980Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.70 GiB is allocated by PyTorch, and 53.93 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:45.6925458Z 2025-05-07T20:31:45.6925660Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:45.6926049Z 2025-05-07T20:31:45.6926218Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.6926941Z self=, 2025-05-07T20:31:45.6927628Z T=1, 2025-05-07T20:31:45.6927925Z D=5120, 2025-05-07T20:31:45.6928234Z scale_ub=1200.0, 2025-05-07T20:31:45.6928585Z contiguous=True, 2025-05-07T20:31:45.6928936Z compiled=False, 2025-05-07T20:31:45.6929268Z ) 2025-05-07T20:31:45.6929784Z self = 2025-05-07T20:31:45.6930611Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:45.6931065Z 2025-05-07T20:31:45.6931194Z @given( 2025-05-07T20:31:45.6931563Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.6932096Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.6932619Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.6933180Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.6934170Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.6934658Z ) 2025-05-07T20:31:45.6935444Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.6936227Z def test_silu_mul_quant( 2025-05-07T20:31:45.6936639Z self, 2025-05-07T20:31:45.6936956Z T: int, 2025-05-07T20:31:45.6937265Z D: int, 2025-05-07T20:31:45.6937619Z scale_ub: Optional[float], 2025-05-07T20:31:45.6938075Z contiguous: bool, 2025-05-07T20:31:45.6938461Z compiled: bool, 2025-05-07T20:31:45.6938827Z ) -> None: 2025-05-07T20:31:45.6939169Z torch.manual_seed(2025) 2025-05-07T20:31:45.6939553Z 2025-05-07T20:31:45.6940108Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.6940711Z 2025-05-07T20:31:45.6941013Z x_sign = torch.sign(x) 2025-05-07T20:31:45.6941500Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.6942030Z x = x_sign * x_clamp 2025-05-07T20:31:45.6942423Z x0 = x[:, :D] 2025-05-07T20:31:45.6942759Z x1 = x[:, D:] 2025-05-07T20:31:45.6943090Z 2025-05-07T20:31:45.6943387Z if contiguous: 2025-05-07T20:31:45.6943754Z x0 = x0.contiguous() 2025-05-07T20:31:45.6944176Z x1 = x1.contiguous() 2025-05-07T20:31:45.6944574Z 2025-05-07T20:31:45.6944879Z if scale_ub is not None: 2025-05-07T20:31:45.6945330Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.6945879Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.6946428Z ) 2025-05-07T20:31:45.6946745Z else: 2025-05-07T20:31:45.6947082Z scale_ub_tensor = None 2025-05-07T20:31:45.6947486Z 2025-05-07T20:31:45.6947866Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.6948397Z op = silu_mul_quant 2025-05-07T20:31:45.6948805Z if compiled: 2025-05-07T20:31:45.6949214Z op = torch.compile(op) 2025-05-07T20:31:45.6949701Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.6950146Z 2025-05-07T20:31:45.6950448Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.6950727Z 2025-05-07T20:31:45.6950889Z moe/activation_test.py:117: 2025-05-07T20:31:45.6951369Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.6951917Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.6952380Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.6953576Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.6954773Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.6955700Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.6956858Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.6958032Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.6958948Z kernel = self.compile( 2025-05-07T20:31:45.6959885Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.6961015Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.6961688Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.6962083Z 2025-05-07T20:31:45.6962409Z self = 2025-05-07T20:31:45.6964303Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.6967040Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c4fa55e10>} 2025-05-07T20:31:45.6969426Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.6971229Z context = 2025-05-07T20:31:45.6971734Z 2025-05-07T20:31:45.6972008Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.6972895Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.6973698Z module_map=module_map) 2025-05-07T20:31:45.6974301Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.6974891Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.6975334Z E ^ 2025-05-07T20:31:45.6976116Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.6976978Z 2025-05-07T20:31:45.6977709Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.6978630Z 2025-05-07T20:31:45.6978801Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.6979505Z self=, 2025-05-07T20:31:45.6980248Z T=2048, 2025-05-07T20:31:45.6980558Z D=5120, 2025-05-07T20:31:45.6980867Z scale_ub=None, 2025-05-07T20:31:45.6981208Z contiguous=True, 2025-05-07T20:31:45.6981573Z compiled=False, 2025-05-07T20:31:45.6981916Z ) 2025-05-07T20:31:45.6982441Z self = 2025-05-07T20:31:45.6983288Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:45.6983780Z 2025-05-07T20:31:45.6983904Z @given( 2025-05-07T20:31:45.6984280Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.6984806Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.6985325Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.6985884Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.6986434Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.6986916Z ) 2025-05-07T20:31:45.6987506Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.6988272Z def test_silu_mul_quant( 2025-05-07T20:31:45.6988675Z self, 2025-05-07T20:31:45.6988991Z T: int, 2025-05-07T20:31:45.6989307Z D: int, 2025-05-07T20:31:45.6989648Z scale_ub: Optional[float], 2025-05-07T20:31:45.6990375Z contiguous: bool, 2025-05-07T20:31:45.6990769Z compiled: bool, 2025-05-07T20:31:45.6991121Z ) -> None: 2025-05-07T20:31:45.6991480Z torch.manual_seed(2025) 2025-05-07T20:31:45.6991880Z 2025-05-07T20:31:45.6992323Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.6992910Z 2025-05-07T20:31:45.6993222Z > x_sign = torch.sign(x) 2025-05-07T20:31:45.6996654Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:45.7000072Z 2025-05-07T20:31:45.7000273Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:31:45.7000646Z 2025-05-07T20:31:45.7001056Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.7001953Z self=, 2025-05-07T20:31:45.7002659Z T=16384, 2025-05-07T20:31:45.7002965Z D=5120, 2025-05-07T20:31:45.7003280Z scale_ub=None, 2025-05-07T20:31:45.7003632Z contiguous=True, 2025-05-07T20:31:45.7003988Z compiled=False, 2025-05-07T20:31:45.7004323Z ) 2025-05-07T20:31:45.7956074Z self = 2025-05-07T20:31:45.7956999Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:45.7957460Z 2025-05-07T20:31:45.7957594Z @given( 2025-05-07T20:31:45.7957943Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.7958432Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.7958931Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.7959472Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.7960038Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.7960523Z ) 2025-05-07T20:31:45.7961117Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.7961870Z def test_silu_mul_quant( 2025-05-07T20:31:45.7962272Z self, 2025-05-07T20:31:45.7962584Z T: int, 2025-05-07T20:31:45.7962890Z D: int, 2025-05-07T20:31:45.7963239Z scale_ub: Optional[float], 2025-05-07T20:31:45.7963684Z contiguous: bool, 2025-05-07T20:31:45.7964070Z compiled: bool, 2025-05-07T20:31:45.7964431Z ) -> None: 2025-05-07T20:31:45.7964770Z torch.manual_seed(2025) 2025-05-07T20:31:45.7965156Z 2025-05-07T20:31:45.7965597Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.7969308Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:45.7972772Z 2025-05-07T20:31:45.7972976Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:45.7973341Z 2025-05-07T20:31:45.7973515Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.7974224Z self=, 2025-05-07T20:31:45.7974902Z T=4096, 2025-05-07T20:31:45.7975205Z D=5120, 2025-05-07T20:31:45.7975502Z scale_ub=None, 2025-05-07T20:31:45.7975852Z contiguous=True, 2025-05-07T20:31:45.7976208Z compiled=False, 2025-05-07T20:31:45.7976534Z ) 2025-05-07T20:31:45.7977060Z self = 2025-05-07T20:31:45.7977917Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:45.7978384Z 2025-05-07T20:31:45.7978514Z @given( 2025-05-07T20:31:45.7978886Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.7979415Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.7980061Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.7980621Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.7981177Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.7981663Z ) 2025-05-07T20:31:45.7982259Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.7983009Z def test_silu_mul_quant( 2025-05-07T20:31:45.7983414Z self, 2025-05-07T20:31:45.7983723Z T: int, 2025-05-07T20:31:45.7984050Z D: int, 2025-05-07T20:31:45.7984404Z scale_ub: Optional[float], 2025-05-07T20:31:45.7985176Z contiguous: bool, 2025-05-07T20:31:45.7985572Z compiled: bool, 2025-05-07T20:31:45.7985934Z ) -> None: 2025-05-07T20:31:45.7986497Z torch.manual_seed(2025) 2025-05-07T20:31:45.7986885Z 2025-05-07T20:31:45.7987346Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.7991259Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:45.7994623Z 2025-05-07T20:31:45.7994837Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:45.7995210Z 2025-05-07T20:31:45.7995382Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.7996088Z self=, 2025-05-07T20:31:45.7996761Z T=2048, 2025-05-07T20:31:45.7997059Z D=5120, 2025-05-07T20:31:45.7997366Z scale_ub=None, 2025-05-07T20:31:45.7997709Z contiguous=False, 2025-05-07T20:31:45.7998073Z compiled=False, 2025-05-07T20:31:45.7998407Z ) 2025-05-07T20:31:45.7998934Z self = 2025-05-07T20:31:45.7999775Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:45.8000239Z 2025-05-07T20:31:45.8000369Z @given( 2025-05-07T20:31:45.8000737Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.8001254Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.8001764Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.8002331Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.8002875Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.8003345Z ) 2025-05-07T20:31:45.8003933Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.8004685Z def test_silu_mul_quant( 2025-05-07T20:31:45.8005083Z self, 2025-05-07T20:31:45.8005398Z T: int, 2025-05-07T20:31:45.8005700Z D: int, 2025-05-07T20:31:45.8006052Z scale_ub: Optional[float], 2025-05-07T20:31:45.8006495Z contiguous: bool, 2025-05-07T20:31:45.8006880Z compiled: bool, 2025-05-07T20:31:45.8007240Z ) -> None: 2025-05-07T20:31:45.8007583Z torch.manual_seed(2025) 2025-05-07T20:31:45.8007974Z 2025-05-07T20:31:45.8008421Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.8012067Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:45.8015452Z 2025-05-07T20:31:45.8015653Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:45.8016015Z 2025-05-07T20:31:45.8016189Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.8016875Z self=, 2025-05-07T20:31:45.8017562Z T=4096, 2025-05-07T20:31:45.8017865Z D=7168, 2025-05-07T20:31:45.8018162Z scale_ub=None, 2025-05-07T20:31:45.8018512Z contiguous=True, 2025-05-07T20:31:45.8018870Z compiled=True, 2025-05-07T20:31:45.8019410Z ) 2025-05-07T20:31:45.8020061Z self = 2025-05-07T20:31:45.8021070Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:45.8021546Z 2025-05-07T20:31:45.8021675Z @given( 2025-05-07T20:31:45.8022037Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.8022562Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.8023059Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.8023604Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.8024161Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.8024637Z ) 2025-05-07T20:31:45.8025221Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.8025991Z def test_silu_mul_quant( 2025-05-07T20:31:45.8026443Z self, 2025-05-07T20:31:45.8026752Z T: int, 2025-05-07T20:31:45.8027077Z D: int, 2025-05-07T20:31:45.8027431Z scale_ub: Optional[float], 2025-05-07T20:31:45.8027881Z contiguous: bool, 2025-05-07T20:31:45.8028277Z compiled: bool, 2025-05-07T20:31:45.8028642Z ) -> None: 2025-05-07T20:31:45.8028996Z torch.manual_seed(2025) 2025-05-07T20:31:45.8029384Z 2025-05-07T20:31:45.8029829Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.8033547Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:45.8036950Z 2025-05-07T20:31:45.8037161Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:45.8037521Z 2025-05-07T20:31:45.8037706Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.8038398Z self=, 2025-05-07T20:31:45.8039087Z T=2048, 2025-05-07T20:31:45.8039389Z D=5120, 2025-05-07T20:31:45.8039685Z scale_ub=1200.0, 2025-05-07T20:31:45.8040054Z contiguous=False, 2025-05-07T20:31:45.8040425Z compiled=False, 2025-05-07T20:31:45.8040752Z ) 2025-05-07T20:31:45.8041280Z self = 2025-05-07T20:31:45.8042112Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:45.8042590Z 2025-05-07T20:31:45.8042714Z @given( 2025-05-07T20:31:45.8043094Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.8043622Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.8044147Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.8044701Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.8045265Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.8045750Z ) 2025-05-07T20:31:45.8046335Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.8047099Z def test_silu_mul_quant( 2025-05-07T20:31:45.8047501Z self, 2025-05-07T20:31:45.8047814Z T: int, 2025-05-07T20:31:45.8048135Z D: int, 2025-05-07T20:31:45.8048498Z scale_ub: Optional[float], 2025-05-07T20:31:45.8048942Z contiguous: bool, 2025-05-07T20:31:45.8049334Z compiled: bool, 2025-05-07T20:31:45.8049703Z ) -> None: 2025-05-07T20:31:45.8050045Z torch.manual_seed(2025) 2025-05-07T20:31:45.8050445Z 2025-05-07T20:31:45.8050888Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.8054728Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:45.8058218Z 2025-05-07T20:31:45.8058426Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:45.8058790Z 2025-05-07T20:31:45.8058958Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.8059666Z self=, 2025-05-07T20:31:45.8060476Z T=4096, 2025-05-07T20:31:45.8060777Z D=7168, 2025-05-07T20:31:45.8061093Z scale_ub=1200.0, 2025-05-07T20:31:45.8061468Z contiguous=True, 2025-05-07T20:31:45.8061830Z compiled=False, 2025-05-07T20:31:45.8062157Z ) 2025-05-07T20:31:45.9328048Z self = 2025-05-07T20:31:45.9328956Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:45.9329428Z 2025-05-07T20:31:45.9329560Z @given( 2025-05-07T20:31:45.9329931Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.9330432Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.9330932Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.9331475Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.9331960Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.9332400Z ) 2025-05-07T20:31:45.9332947Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.9333657Z def test_silu_mul_quant( 2025-05-07T20:31:45.9334061Z self, 2025-05-07T20:31:45.9334379Z T: int, 2025-05-07T20:31:45.9334686Z D: int, 2025-05-07T20:31:45.9335044Z scale_ub: Optional[float], 2025-05-07T20:31:45.9335495Z contiguous: bool, 2025-05-07T20:31:45.9335893Z compiled: bool, 2025-05-07T20:31:45.9336262Z ) -> None: 2025-05-07T20:31:45.9336624Z torch.manual_seed(2025) 2025-05-07T20:31:45.9337042Z 2025-05-07T20:31:45.9337487Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.9353978Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:45.9357546Z 2025-05-07T20:31:45.9357763Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:45.9358148Z 2025-05-07T20:31:45.9358325Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.9359056Z self=, 2025-05-07T20:31:45.9359704Z T=16384, 2025-05-07T20:31:45.9359999Z D=7168, 2025-05-07T20:31:45.9360290Z scale_ub=None, 2025-05-07T20:31:45.9360641Z contiguous=False, 2025-05-07T20:31:45.9361019Z compiled=True, 2025-05-07T20:31:45.9361372Z ) 2025-05-07T20:31:45.9361910Z self = 2025-05-07T20:31:45.9362788Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:45.9363270Z 2025-05-07T20:31:45.9363412Z @given( 2025-05-07T20:31:45.9363791Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.9364776Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.9365303Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.9366090Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.9366665Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.9367167Z ) 2025-05-07T20:31:45.9367775Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.9368558Z def test_silu_mul_quant( 2025-05-07T20:31:45.9368973Z self, 2025-05-07T20:31:45.9369305Z T: int, 2025-05-07T20:31:45.9369625Z D: int, 2025-05-07T20:31:45.9369995Z scale_ub: Optional[float], 2025-05-07T20:31:45.9370463Z contiguous: bool, 2025-05-07T20:31:45.9370864Z compiled: bool, 2025-05-07T20:31:45.9371245Z ) -> None: 2025-05-07T20:31:45.9371617Z torch.manual_seed(2025) 2025-05-07T20:31:45.9372020Z 2025-05-07T20:31:45.9372485Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.9376254Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:45.9379712Z 2025-05-07T20:31:45.9379995Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:45.9380350Z 2025-05-07T20:31:45.9380522Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.9381168Z self=, 2025-05-07T20:31:45.9381848Z T=4096, 2025-05-07T20:31:45.9382172Z D=7168, 2025-05-07T20:31:45.9382481Z scale_ub=None, 2025-05-07T20:31:45.9382846Z contiguous=True, 2025-05-07T20:31:45.9383237Z compiled=False, 2025-05-07T20:31:45.9383584Z ) 2025-05-07T20:31:45.9384135Z self = 2025-05-07T20:31:45.9384988Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:45.9385466Z 2025-05-07T20:31:45.9385604Z @given( 2025-05-07T20:31:45.9385981Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.9386516Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.9387035Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.9387598Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.9388172Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.9388672Z ) 2025-05-07T20:31:45.9389268Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.9390403Z def test_silu_mul_quant( 2025-05-07T20:31:45.9390803Z self, 2025-05-07T20:31:45.9391114Z T: int, 2025-05-07T20:31:45.9391446Z D: int, 2025-05-07T20:31:45.9391811Z scale_ub: Optional[float], 2025-05-07T20:31:45.9392269Z contiguous: bool, 2025-05-07T20:31:45.9392667Z compiled: bool, 2025-05-07T20:31:45.9393048Z ) -> None: 2025-05-07T20:31:45.9393409Z torch.manual_seed(2025) 2025-05-07T20:31:45.9393812Z 2025-05-07T20:31:45.9394272Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.9398182Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:45.9401732Z 2025-05-07T20:31:45.9401946Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:45.9402301Z 2025-05-07T20:31:45.9402482Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.9403194Z self=, 2025-05-07T20:31:45.9403914Z T=16384, 2025-05-07T20:31:45.9404241Z D=7168, 2025-05-07T20:31:45.9404557Z scale_ub=None, 2025-05-07T20:31:45.9404925Z contiguous=True, 2025-05-07T20:31:45.9405298Z compiled=False, 2025-05-07T20:31:45.9405642Z ) 2025-05-07T20:31:45.9406186Z self = 2025-05-07T20:31:45.9407094Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:45.9407590Z 2025-05-07T20:31:45.9407720Z @given( 2025-05-07T20:31:45.9408116Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.9408646Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.9409200Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.9409776Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.9410350Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.9410833Z ) 2025-05-07T20:31:45.9411437Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.9412208Z def test_silu_mul_quant( 2025-05-07T20:31:45.9412609Z self, 2025-05-07T20:31:45.9412938Z T: int, 2025-05-07T20:31:45.9413265Z D: int, 2025-05-07T20:31:45.9413625Z scale_ub: Optional[float], 2025-05-07T20:31:45.9414089Z contiguous: bool, 2025-05-07T20:31:45.9414497Z compiled: bool, 2025-05-07T20:31:45.9414866Z ) -> None: 2025-05-07T20:31:45.9415232Z torch.manual_seed(2025) 2025-05-07T20:31:45.9415654Z 2025-05-07T20:31:45.9416105Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.9419906Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:45.9423331Z 2025-05-07T20:31:45.9423539Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:45.9423893Z 2025-05-07T20:31:45.9424067Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.9424779Z self=, 2025-05-07T20:31:45.9425492Z T=16384, 2025-05-07T20:31:45.9425813Z D=7168, 2025-05-07T20:31:45.9426132Z scale_ub=1200.0, 2025-05-07T20:31:45.9426504Z contiguous=True, 2025-05-07T20:31:45.9426877Z compiled=False, 2025-05-07T20:31:45.9427228Z ) 2025-05-07T20:31:45.9427763Z self = 2025-05-07T20:31:45.9428625Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:45.9429124Z 2025-05-07T20:31:45.9429255Z @given( 2025-05-07T20:31:45.9429643Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.9430167Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.9430700Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.9431279Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.9431841Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.9432343Z ) 2025-05-07T20:31:45.9432958Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.9433882Z def test_silu_mul_quant( 2025-05-07T20:31:45.9434420Z self, 2025-05-07T20:31:45.9434753Z T: int, 2025-05-07T20:31:45.9435073Z D: int, 2025-05-07T20:31:45.9435445Z scale_ub: Optional[float], 2025-05-07T20:31:45.9435908Z contiguous: bool, 2025-05-07T20:31:45.9436322Z compiled: bool, 2025-05-07T20:31:45.9436687Z ) -> None: 2025-05-07T20:31:45.9437049Z torch.manual_seed(2025) 2025-05-07T20:31:45.9437462Z 2025-05-07T20:31:45.9437909Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.9441605Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:45.9444964Z 2025-05-07T20:31:45.9445175Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:45.9445547Z 2025-05-07T20:31:45.9445733Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.9446446Z self=, 2025-05-07T20:31:45.9447149Z T=128, 2025-05-07T20:31:45.9447468Z D=5120, 2025-05-07T20:31:45.9447786Z scale_ub=1200.0, 2025-05-07T20:31:45.9448152Z contiguous=False, 2025-05-07T20:31:45.9448545Z compiled=False, 2025-05-07T20:31:45.9448889Z ) 2025-05-07T20:31:46.3055193Z self = 2025-05-07T20:31:46.3056088Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:46.3056586Z 2025-05-07T20:31:46.3056719Z @given( 2025-05-07T20:31:46.3057102Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:46.3057620Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:46.3058129Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:46.3058664Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:46.3059209Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:46.3059695Z ) 2025-05-07T20:31:46.3060459Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:46.3061116Z def test_silu_mul_quant( 2025-05-07T20:31:46.3061458Z self, 2025-05-07T20:31:46.3061726Z T: int, 2025-05-07T20:31:46.3061992Z D: int, 2025-05-07T20:31:46.3062294Z scale_ub: Optional[float], 2025-05-07T20:31:46.3062681Z contiguous: bool, 2025-05-07T20:31:46.3063013Z compiled: bool, 2025-05-07T20:31:46.3063334Z ) -> None: 2025-05-07T20:31:46.3063657Z torch.manual_seed(2025) 2025-05-07T20:31:46.3064007Z 2025-05-07T20:31:46.3064418Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:46.3064950Z 2025-05-07T20:31:46.3065238Z x_sign = torch.sign(x) 2025-05-07T20:31:46.3065655Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:46.3066106Z x = x_sign * x_clamp 2025-05-07T20:31:46.3066486Z x0 = x[:, :D] 2025-05-07T20:31:46.3066827Z x1 = x[:, D:] 2025-05-07T20:31:46.3067133Z 2025-05-07T20:31:46.3067421Z if contiguous: 2025-05-07T20:31:46.3067778Z x0 = x0.contiguous() 2025-05-07T20:31:46.3068177Z x1 = x1.contiguous() 2025-05-07T20:31:46.3068565Z 2025-05-07T20:31:46.3068881Z if scale_ub is not None: 2025-05-07T20:31:46.3069354Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:46.3069932Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:46.3070914Z ) 2025-05-07T20:31:46.3071238Z else: 2025-05-07T20:31:46.3071594Z scale_ub_tensor = None 2025-05-07T20:31:46.3072244Z 2025-05-07T20:31:46.3072646Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:46.3073195Z op = silu_mul_quant 2025-05-07T20:31:46.3073610Z if compiled: 2025-05-07T20:31:46.3074028Z op = torch.compile(op) 2025-05-07T20:31:46.3074532Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.3074999Z 2025-05-07T20:31:46.3075322Z > y_fp8, y_scale = fn() 2025-05-07T20:31:46.3075603Z 2025-05-07T20:31:46.3075779Z moe/activation_test.py:117: 2025-05-07T20:31:46.3076279Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.3076846Z moe/activation_test.py:115: in fn 2025-05-07T20:31:46.3077290Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.3078457Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:46.3079606Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:46.3080537Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:46.3081706Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:46.3082835Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:46.3083774Z kernel = self.compile( 2025-05-07T20:31:46.3084717Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:46.3085847Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:46.3086523Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.3086927Z 2025-05-07T20:31:46.3087277Z self = 2025-05-07T20:31:46.3089234Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:46.3092123Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c54125cf0>} 2025-05-07T20:31:46.3094502Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:46.3096314Z context = 2025-05-07T20:31:46.3096830Z 2025-05-07T20:31:46.3097113Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:46.3098029Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:46.3098852Z module_map=module_map) 2025-05-07T20:31:46.3099468Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:46.3100153Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:46.3100598Z E ^ 2025-05-07T20:31:46.3101413Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:46.3102207Z 2025-05-07T20:31:46.3102932Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:46.3103807Z 2025-05-07T20:31:46.3103967Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:46.3104614Z self=, 2025-05-07T20:31:46.3105307Z T=2048, 2025-05-07T20:31:46.3105606Z D=7168, 2025-05-07T20:31:46.3106174Z scale_ub=None, 2025-05-07T20:31:46.3106541Z contiguous=False, 2025-05-07T20:31:46.3106885Z compiled=False, 2025-05-07T20:31:46.3107206Z ) 2025-05-07T20:31:46.3107882Z self = 2025-05-07T20:31:46.3108709Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:46.3109165Z 2025-05-07T20:31:46.3109297Z @given( 2025-05-07T20:31:46.3109651Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:46.3110161Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:46.3110664Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:46.3111210Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:46.3111749Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:46.3112227Z ) 2025-05-07T20:31:46.3112810Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:46.3113556Z def test_silu_mul_quant( 2025-05-07T20:31:46.3113977Z self, 2025-05-07T20:31:46.3114281Z T: int, 2025-05-07T20:31:46.3114587Z D: int, 2025-05-07T20:31:46.3114950Z scale_ub: Optional[float], 2025-05-07T20:31:46.3115380Z contiguous: bool, 2025-05-07T20:31:46.3115768Z compiled: bool, 2025-05-07T20:31:46.3116133Z ) -> None: 2025-05-07T20:31:46.3116483Z torch.manual_seed(2025) 2025-05-07T20:31:46.3116871Z 2025-05-07T20:31:46.3117325Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:46.3121007Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 5.24 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:46.3124367Z 2025-05-07T20:31:46.3124568Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:46.3124931Z 2025-05-07T20:31:46.3125107Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:46.3125797Z self=, 2025-05-07T20:31:46.3126489Z T=128, 2025-05-07T20:31:46.3126791Z D=7168, 2025-05-07T20:31:46.3127100Z scale_ub=1200.0, 2025-05-07T20:31:46.3127456Z contiguous=True, 2025-05-07T20:31:46.3127807Z compiled=True, 2025-05-07T20:31:46.3128133Z ) 2025-05-07T20:31:46.3544228Z self = 2025-05-07T20:31:46.3545117Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:46.3545561Z 2025-05-07T20:31:46.3545686Z @given( 2025-05-07T20:31:46.3546038Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:46.3546494Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:46.3546934Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:46.3547431Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:46.3547964Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:46.3548420Z ) 2025-05-07T20:31:46.3549019Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:46.3549785Z def test_silu_mul_quant( 2025-05-07T20:31:46.3550185Z self, 2025-05-07T20:31:46.3550469Z T: int, 2025-05-07T20:31:46.3550765Z D: int, 2025-05-07T20:31:46.3551093Z scale_ub: Optional[float], 2025-05-07T20:31:46.3551509Z contiguous: bool, 2025-05-07T20:31:46.3551866Z compiled: bool, 2025-05-07T20:31:46.3552205Z ) -> None: 2025-05-07T20:31:46.3552535Z torch.manual_seed(2025) 2025-05-07T20:31:46.3552935Z 2025-05-07T20:31:46.3553661Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:46.3554147Z 2025-05-07T20:31:46.3554562Z x_sign = torch.sign(x) 2025-05-07T20:31:46.3554987Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:46.3555428Z x = x_sign * x_clamp 2025-05-07T20:31:46.3555771Z x0 = x[:, :D] 2025-05-07T20:31:46.3556084Z x1 = x[:, D:] 2025-05-07T20:31:46.3556378Z 2025-05-07T20:31:46.3556653Z if contiguous: 2025-05-07T20:31:46.3557001Z x0 = x0.contiguous() 2025-05-07T20:31:46.3557387Z x1 = x1.contiguous() 2025-05-07T20:31:46.3557744Z 2025-05-07T20:31:46.3558021Z if scale_ub is not None: 2025-05-07T20:31:46.3558415Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:46.3558905Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:46.3559345Z ) 2025-05-07T20:31:46.3559621Z else: 2025-05-07T20:31:46.3559915Z scale_ub_tensor = None 2025-05-07T20:31:46.3560285Z 2025-05-07T20:31:46.3560612Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:46.3561062Z op = silu_mul_quant 2025-05-07T20:31:46.3561423Z if compiled: 2025-05-07T20:31:46.3561774Z op = torch.compile(op) 2025-05-07T20:31:46.3562188Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.3562598Z 2025-05-07T20:31:46.3562865Z > y_fp8, y_scale = fn() 2025-05-07T20:31:46.3563099Z 2025-05-07T20:31:46.3563236Z moe/activation_test.py:117: 2025-05-07T20:31:46.3563659Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.3564138Z moe/activation_test.py:115: in fn 2025-05-07T20:31:46.3564542Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.3565359Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:46.3566195Z return fn(*args, **kwargs) 2025-05-07T20:31:46.3567196Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:46.3568237Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:46.3569036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:46.3570062Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:46.3571065Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:46.3571857Z kernel = self.compile( 2025-05-07T20:31:46.3572665Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:46.3573659Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:46.3574232Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.3574588Z 2025-05-07T20:31:46.3574881Z self = 2025-05-07T20:31:46.3576587Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:46.3578757Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c541270a0>} 2025-05-07T20:31:46.3581077Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:46.3582747Z context = 2025-05-07T20:31:46.3583206Z 2025-05-07T20:31:46.3583459Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:46.3584537Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:46.3585281Z module_map=module_map) 2025-05-07T20:31:46.3585823Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:46.3586359Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:46.3586752Z E ^ 2025-05-07T20:31:46.3587485Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:46.3588242Z 2025-05-07T20:31:46.3588923Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:46.3589776Z 2025-05-07T20:31:46.3590281Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:46.3590924Z self=, 2025-05-07T20:31:46.3591541Z T=128, 2025-05-07T20:31:46.3591824Z D=7168, 2025-05-07T20:31:46.3592105Z scale_ub=1200.0, 2025-05-07T20:31:46.3592421Z contiguous=True, 2025-05-07T20:31:46.3592755Z compiled=False, 2025-05-07T20:31:46.3593057Z ) 2025-05-07T20:31:46.3593527Z self = 2025-05-07T20:31:46.3594293Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:46.3594721Z 2025-05-07T20:31:46.3594834Z @given( 2025-05-07T20:31:46.3595165Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:46.3595625Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:46.3596091Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:46.3596601Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:46.3597099Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:46.3597532Z ) 2025-05-07T20:31:46.3598068Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:46.3598766Z def test_silu_mul_quant( 2025-05-07T20:31:46.3599122Z self, 2025-05-07T20:31:46.3599409Z T: int, 2025-05-07T20:31:46.3599687Z D: int, 2025-05-07T20:31:46.3599994Z scale_ub: Optional[float], 2025-05-07T20:31:46.3600382Z contiguous: bool, 2025-05-07T20:31:46.3600721Z compiled: bool, 2025-05-07T20:31:46.3601031Z ) -> None: 2025-05-07T20:31:46.3601338Z torch.manual_seed(2025) 2025-05-07T20:31:46.3601692Z 2025-05-07T20:31:46.3602081Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:46.3602596Z 2025-05-07T20:31:46.3602880Z x_sign = torch.sign(x) 2025-05-07T20:31:46.3603316Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:46.3606401Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 4.62 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:46.3609205Z 2025-05-07T20:31:46.3609380Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:31:46.3609696Z 2025-05-07T20:31:46.3609846Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:46.3610446Z self=, 2025-05-07T20:31:46.3611023Z T=128, 2025-05-07T20:31:46.3611293Z D=5120, 2025-05-07T20:31:46.3611560Z scale_ub=1200.0, 2025-05-07T20:31:46.3611863Z contiguous=True, 2025-05-07T20:31:46.3612174Z compiled=True, 2025-05-07T20:31:46.3612458Z ) 2025-05-07T20:31:46.3612917Z self = 2025-05-07T20:31:46.3613948Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:46.3625549Z 2025-05-07T20:31:46.3625679Z @given( 2025-05-07T20:31:46.3626026Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:46.3626475Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:46.3626927Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:46.3627421Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:46.3627905Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:46.3628315Z ) 2025-05-07T20:31:46.3628826Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:46.3629493Z def test_silu_mul_quant( 2025-05-07T20:31:46.3629836Z self, 2025-05-07T20:31:46.3630115Z T: int, 2025-05-07T20:31:46.3630397Z D: int, 2025-05-07T20:31:46.3630700Z scale_ub: Optional[float], 2025-05-07T20:31:46.3631109Z contiguous: bool, 2025-05-07T20:31:46.3631455Z compiled: bool, 2025-05-07T20:31:46.3631768Z ) -> None: 2025-05-07T20:31:46.3632087Z torch.manual_seed(2025) 2025-05-07T20:31:46.3632442Z 2025-05-07T20:31:46.3632822Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:46.3633325Z 2025-05-07T20:31:46.3633607Z > x_sign = torch.sign(x) 2025-05-07T20:31:46.3636573Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:46.3639421Z 2025-05-07T20:31:46.3639601Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:31:46.3639913Z 2025-05-07T20:31:46.3640072Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:46.3640686Z self=, 2025-05-07T20:31:46.3641282Z T=128, 2025-05-07T20:31:46.3641545Z D=7168, 2025-05-07T20:31:46.3641817Z scale_ub=None, 2025-05-07T20:31:46.3642124Z contiguous=True, 2025-05-07T20:31:46.3642435Z compiled=True, 2025-05-07T20:31:46.3642728Z ) 2025-05-07T20:31:46.6709494Z self = 2025-05-07T20:31:46.6710048Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:46.6710315Z 2025-05-07T20:31:46.6710400Z @given( 2025-05-07T20:31:46.6710647Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:46.6710969Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:46.6711308Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:46.6711632Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:46.6711986Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:46.6712286Z ) 2025-05-07T20:31:46.6712636Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:46.6713081Z def test_silu_mul_quant( 2025-05-07T20:31:46.6713330Z self, 2025-05-07T20:31:46.6713523Z T: int, 2025-05-07T20:31:46.6713728Z D: int, 2025-05-07T20:31:46.6713960Z scale_ub: Optional[float], 2025-05-07T20:31:46.6714229Z contiguous: bool, 2025-05-07T20:31:46.6714476Z compiled: bool, 2025-05-07T20:31:46.6714709Z ) -> None: 2025-05-07T20:31:46.6714925Z torch.manual_seed(2025) 2025-05-07T20:31:46.6715176Z 2025-05-07T20:31:46.6715456Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:46.6717762Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:46.6719852Z 2025-05-07T20:31:46.6719974Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:46.6720189Z 2025-05-07T20:31:46.6778441Z FAILED 2025-05-07T20:31:46.6778581Z 2025-05-07T20:31:46.6778736Z =================================== FAILURES =================================== 2025-05-07T20:31:46.6779279Z _____________________ ActivationTests.test_silu_mul_quant ______________________ 2025-05-07T20:31:46.6780129Z + Exception Group Traceback (most recent call last): 2025-05-07T20:31:46.6781022Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/unittest/case.py", line 59, in testPartExecutor 2025-05-07T20:31:46.6781792Z | yield 2025-05-07T20:31:46.6782394Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/unittest/case.py", line 591, in run 2025-05-07T20:31:46.6783114Z | self._callTestMethod(testMethod) 2025-05-07T20:31:46.6783898Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/unittest/case.py", line 549, in _callTestMethod 2025-05-07T20:31:46.6784655Z | method() 2025-05-07T20:31:46.6785547Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant 2025-05-07T20:31:46.6786553Z | T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:46.6787443Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/hypothesis/core.py", line 1850, in wrapped_test 2025-05-07T20:31:46.6788319Z | raise the_error_hypothesis_found 2025-05-07T20:31:46.6788996Z | exceptiongroup.ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions) 2025-05-07T20:31:46.6789669Z +-+---------------- 1 ---------------- 2025-05-07T20:31:46.6790312Z | Traceback (most recent call last): 2025-05-07T20:31:46.6791313Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:31:46.6792386Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:46.6795276Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:46.6798017Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:31:46.6798469Z | self=, 2025-05-07T20:31:46.6798881Z | T=128, 2025-05-07T20:31:46.6799083Z | D=7168, 2025-05-07T20:31:46.6799303Z | scale_ub=1200.0, 2025-05-07T20:31:46.6799553Z | contiguous=True, 2025-05-07T20:31:46.6799798Z | compiled=False, 2025-05-07T20:31:46.6800032Z | ) 2025-05-07T20:31:46.6800220Z | 2025-05-07T20:31:46.6800743Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAUEAQQE=') as a decorator on your test case 2025-05-07T20:31:46.6801351Z +---------------- 2 ---------------- 2025-05-07T20:31:46.6801650Z | Traceback (most recent call last): 2025-05-07T20:31:46.6802701Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:31:46.6803485Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:46.6805522Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:46.6807523Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:31:46.6807975Z | self=, 2025-05-07T20:31:46.6808377Z | T=128, 2025-05-07T20:31:46.6808594Z | D=7168, 2025-05-07T20:31:46.6808815Z | scale_ub=None, 2025-05-07T20:31:46.6809071Z | contiguous=True, 2025-05-07T20:31:46.6809317Z | compiled=True, 2025-05-07T20:31:46.6809551Z | ) 2025-05-07T20:31:46.6809741Z | 2025-05-07T20:31:46.6810267Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case 2025-05-07T20:31:46.6810887Z +---------------- 3 ---------------- 2025-05-07T20:31:46.6811186Z | Traceback (most recent call last): 2025-05-07T20:31:46.6811899Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:31:46.6812678Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:46.6814730Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:46.6817562Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:31:46.6818175Z | self=, 2025-05-07T20:31:46.6818734Z | T=128, 2025-05-07T20:31:46.6819017Z | D=5120, 2025-05-07T20:31:46.6819302Z | scale_ub=1200.0, 2025-05-07T20:31:46.6819629Z | contiguous=True, 2025-05-07T20:31:46.6820143Z | compiled=True, 2025-05-07T20:31:46.6820466Z | ) 2025-05-07T20:31:46.6820702Z | 2025-05-07T20:31:46.6821432Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case 2025-05-07T20:31:46.6822281Z +---------------- 4 ---------------- 2025-05-07T20:31:46.6822682Z | Traceback (most recent call last): 2025-05-07T20:31:46.6823665Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant 2025-05-07T20:31:46.6824654Z | y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:46.6825576Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn 2025-05-07T20:31:46.6826538Z | return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:46.6827685Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row 2025-05-07T20:31:46.6828997Z | _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:46.6829846Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py", line 330, in 2025-05-07T20:31:46.6830867Z | return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:46.6831899Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 186, in run 2025-05-07T20:31:46.6832981Z | timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:46.6834076Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 186, in 2025-05-07T20:31:46.6834888Z | timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:46.6835678Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 166, in _bench 2025-05-07T20:31:46.6836365Z | return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:46.6837013Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py", line 117, in do_bench 2025-05-07T20:31:46.6837574Z | fn() 2025-05-07T20:31:46.6838137Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call 2025-05-07T20:31:46.6838763Z | self.fn.run( 2025-05-07T20:31:46.6839292Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py", line 623, in run 2025-05-07T20:31:46.6839864Z | kernel = self.compile( 2025-05-07T20:31:46.6840465Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 273, in compile 2025-05-07T20:31:46.6841177Z | module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:46.6841885Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:46.6842670Z | return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:46.6843195Z | triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:46.6843555Z | def _kernel_quantize_fp8_row( 2025-05-07T20:31:46.6843811Z | ^ 2025-05-07T20:31:46.6844271Z | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:46.6844836Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:31:46.6845233Z | # The test always failed when commented parts were varied together. 2025-05-07T20:31:46.6845751Z | self=, 2025-05-07T20:31:46.6846192Z | T=1, # or any other generated value 2025-05-07T20:31:46.6846506Z | D=5120, # or any other generated value 2025-05-07T20:31:46.6846841Z | scale_ub=None, # or any other generated value 2025-05-07T20:31:46.6847204Z | contiguous=True, # or any other generated value 2025-05-07T20:31:46.6847570Z | compiled=True, # or any other generated value 2025-05-07T20:31:46.6847869Z | ) 2025-05-07T20:31:46.6848044Z | 2025-05-07T20:31:46.6848682Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case 2025-05-07T20:31:46.6849545Z +------------------------------------ 2025-05-07T20:31:46.6850057Z ---------------------------------- Hypothesis ---------------------------------- 2025-05-07T20:31:46.6850602Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:46.6851354Z self=, 2025-05-07T20:31:46.6851907Z T=1, 2025-05-07T20:31:46.6852294Z D=5120, 2025-05-07T20:31:46.6852567Z scale_ub=None, 2025-05-07T20:31:46.6852860Z contiguous=True, 2025-05-07T20:31:46.6853172Z compiled=True, 2025-05-07T20:31:46.6853471Z ) 2025-05-07T20:31:46.6853905Z self = 2025-05-07T20:31:46.6854574Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:46.6854944Z 2025-05-07T20:31:46.6855057Z @given( 2025-05-07T20:31:46.6855387Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:46.6855824Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:46.6856262Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:46.6856731Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:46.6857185Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:46.6857603Z ) 2025-05-07T20:31:46.6858098Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:46.6858712Z def test_silu_mul_quant( 2025-05-07T20:31:46.6859023Z self, 2025-05-07T20:31:46.6859285Z T: int, 2025-05-07T20:31:46.6859534Z D: int, 2025-05-07T20:31:46.6859826Z scale_ub: Optional[float], 2025-05-07T20:31:46.6860449Z contiguous: bool, 2025-05-07T20:31:46.6860786Z compiled: bool, 2025-05-07T20:31:46.6861101Z ) -> None: 2025-05-07T20:31:46.6861403Z torch.manual_seed(2025) 2025-05-07T20:31:46.6861747Z 2025-05-07T20:31:46.6862125Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:46.6862610Z 2025-05-07T20:31:46.6862894Z x_sign = torch.sign(x) 2025-05-07T20:31:46.6863301Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:46.6863743Z x = x_sign * x_clamp 2025-05-07T20:31:46.6864083Z x0 = x[:, :D] 2025-05-07T20:31:46.6864366Z x1 = x[:, D:] 2025-05-07T20:31:46.6864640Z 2025-05-07T20:31:46.6864897Z if contiguous: 2025-05-07T20:31:46.6865209Z x0 = x0.contiguous() 2025-05-07T20:31:46.6865576Z x1 = x1.contiguous() 2025-05-07T20:31:46.6865902Z 2025-05-07T20:31:46.6866135Z if scale_ub is not None: 2025-05-07T20:31:46.6866473Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:46.6866930Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:46.6867371Z ) 2025-05-07T20:31:46.6867639Z else: 2025-05-07T20:31:46.6867941Z scale_ub_tensor = None 2025-05-07T20:31:46.6868297Z 2025-05-07T20:31:46.6868615Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:46.6869063Z op = silu_mul_quant 2025-05-07T20:31:46.6869423Z if compiled: 2025-05-07T20:31:46.6869770Z op = torch.compile(op) 2025-05-07T20:31:46.6870182Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.6870573Z 2025-05-07T20:31:46.6870844Z y_fp8, y_scale = fn() 2025-05-07T20:31:46.6871258Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:46.6871671Z 2025-05-07T20:31:46.6872007Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:46.6872477Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:46.6872894Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:46.6873332Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:46.6873836Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:46.6874263Z 2025-05-07T20:31:46.6874549Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:46.6874824Z 2025-05-07T20:31:46.6874967Z moe/activation_test.py:126: 2025-05-07T20:31:46.6875390Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.6875856Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:46.6876462Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:46.6877647Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:46.6878691Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:46.6879445Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:46.6880402Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:46.6881401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:46.6882383Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:46.6883442Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:46.6884494Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:46.6886769Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:46.6887684Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:46.6888525Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:46.6889239Z fn() 2025-05-07T20:31:46.6890218Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:46.6891047Z self.fn.run( 2025-05-07T20:31:46.6891701Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:46.6892435Z kernel = self.compile( 2025-05-07T20:31:46.6893200Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:46.6894122Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:46.6894671Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.6894996Z 2025-05-07T20:31:46.6895286Z self = 2025-05-07T20:31:46.6896782Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:46.6898578Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c7e568550>} 2025-05-07T20:31:46.6900361Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:46.6901706Z context = 2025-05-07T20:31:46.6902071Z 2025-05-07T20:31:46.6902269Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:46.6902944Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:46.6903528Z module_map=module_map) 2025-05-07T20:31:46.6904004Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:46.6904458Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:46.6904802Z E ^ 2025-05-07T20:31:46.6905409Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:46.6906022Z 2025-05-07T20:31:46.6906581Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:46.6907505Z 2025-05-07T20:31:46.6907639Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:46.6908261Z self=, 2025-05-07T20:31:46.6908780Z T=2048, 2025-05-07T20:31:46.6909013Z D=5120, 2025-05-07T20:31:46.6909249Z scale_ub=1200.0, 2025-05-07T20:31:46.6909513Z contiguous=True, 2025-05-07T20:31:46.6909792Z compiled=False, 2025-05-07T20:31:46.6910079Z ) 2025-05-07T20:31:46.6910461Z self = 2025-05-07T20:31:46.6911060Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:46.6911390Z 2025-05-07T20:31:46.6911491Z @given( 2025-05-07T20:31:46.6911763Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:46.6912138Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:46.6912506Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:46.6912905Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:46.6913309Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:46.6913654Z ) 2025-05-07T20:31:46.6914086Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:46.6914689Z def test_silu_mul_quant( 2025-05-07T20:31:46.6915032Z self, 2025-05-07T20:31:46.6915306Z T: int, 2025-05-07T20:31:46.6915540Z D: int, 2025-05-07T20:31:46.6915804Z scale_ub: Optional[float], 2025-05-07T20:31:46.6916134Z contiguous: bool, 2025-05-07T20:31:46.6916451Z compiled: bool, 2025-05-07T20:31:46.6916748Z ) -> None: 2025-05-07T20:31:46.6917012Z torch.manual_seed(2025) 2025-05-07T20:31:46.6917302Z 2025-05-07T20:31:46.6917632Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:46.6918050Z 2025-05-07T20:31:46.6918276Z x_sign = torch.sign(x) 2025-05-07T20:31:46.6918629Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:46.6919014Z x = x_sign * x_clamp 2025-05-07T20:31:46.6919325Z x0 = x[:, :D] 2025-05-07T20:31:46.6919586Z x1 = x[:, D:] 2025-05-07T20:31:46.6919847Z 2025-05-07T20:31:46.6920074Z if contiguous: 2025-05-07T20:31:46.6920350Z x0 = x0.contiguous() 2025-05-07T20:31:46.6920664Z x1 = x1.contiguous() 2025-05-07T20:31:46.6920957Z 2025-05-07T20:31:46.6921216Z if scale_ub is not None: 2025-05-07T20:31:46.6921599Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:46.6922067Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:46.6922502Z ) 2025-05-07T20:31:46.6943567Z else: 2025-05-07T20:31:46.6943940Z scale_ub_tensor = None 2025-05-07T20:31:46.6944285Z 2025-05-07T20:31:46.6944588Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:46.6945027Z op = silu_mul_quant 2025-05-07T20:31:46.6945392Z if compiled: 2025-05-07T20:31:46.6945775Z op = torch.compile(op) 2025-05-07T20:31:46.6946208Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.6946615Z 2025-05-07T20:31:46.6946907Z > y_fp8, y_scale = fn() 2025-05-07T20:31:46.6947147Z 2025-05-07T20:31:46.6947304Z moe/activation_test.py:117: 2025-05-07T20:31:46.6947722Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.6948189Z moe/activation_test.py:115: in fn 2025-05-07T20:31:46.6948593Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.6949526Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:46.6950455Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:46.6951187Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:46.6952101Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:46.6953274Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:46.6954022Z kernel = self.compile( 2025-05-07T20:31:46.6954761Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:46.6955654Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:46.6956196Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.6956522Z 2025-05-07T20:31:46.6956806Z self = 2025-05-07T20:31:46.6958314Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:46.6960245Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c7f67b250>} 2025-05-07T20:31:46.6962113Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:46.6963549Z context = 2025-05-07T20:31:46.6963963Z 2025-05-07T20:31:46.6964195Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:46.6964931Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:46.6965583Z module_map=module_map) 2025-05-07T20:31:46.6966099Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:46.6966583Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:46.6966962Z E ^ 2025-05-07T20:31:46.6967616Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:46.6968261Z 2025-05-07T20:31:46.6968869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:46.6969616Z 2025-05-07T20:31:46.6969767Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:46.6970316Z self=, 2025-05-07T20:31:46.6970888Z T=2048, 2025-05-07T20:31:46.6971158Z D=5120, 2025-05-07T20:31:46.6971428Z scale_ub=1200.0, 2025-05-07T20:31:46.6971747Z contiguous=True, 2025-05-07T20:31:46.6972058Z compiled=True, 2025-05-07T20:31:46.6972350Z ) 2025-05-07T20:31:46.6972799Z self = 2025-05-07T20:31:46.6973486Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:46.6973861Z 2025-05-07T20:31:46.6973976Z @given( 2025-05-07T20:31:46.6974288Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:46.6974736Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:46.6975167Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:46.6975624Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:46.6976093Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:46.6976499Z ) 2025-05-07T20:31:46.6976984Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:46.6977607Z def test_silu_mul_quant( 2025-05-07T20:31:46.6977952Z self, 2025-05-07T20:31:46.6978226Z T: int, 2025-05-07T20:31:46.6978494Z D: int, 2025-05-07T20:31:46.6978805Z scale_ub: Optional[float], 2025-05-07T20:31:46.6979191Z contiguous: bool, 2025-05-07T20:31:46.6979526Z compiled: bool, 2025-05-07T20:31:46.6980012Z ) -> None: 2025-05-07T20:31:46.6980435Z torch.manual_seed(2025) 2025-05-07T20:31:46.6980786Z 2025-05-07T20:31:46.6981259Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:46.6981734Z 2025-05-07T20:31:46.6982000Z x_sign = torch.sign(x) 2025-05-07T20:31:46.6982395Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:46.6982822Z x = x_sign * x_clamp 2025-05-07T20:31:46.6983144Z x0 = x[:, :D] 2025-05-07T20:31:46.6983461Z x1 = x[:, D:] 2025-05-07T20:31:46.6983752Z 2025-05-07T20:31:46.6984010Z if contiguous: 2025-05-07T20:31:46.6984339Z x0 = x0.contiguous() 2025-05-07T20:31:46.6984691Z x1 = x1.contiguous() 2025-05-07T20:31:46.6985020Z 2025-05-07T20:31:46.6985274Z if scale_ub is not None: 2025-05-07T20:31:46.6985651Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:46.6986071Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:46.6986454Z ) 2025-05-07T20:31:46.6986685Z else: 2025-05-07T20:31:46.6986959Z scale_ub_tensor = None 2025-05-07T20:31:46.6987268Z 2025-05-07T20:31:46.6987555Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:46.6987959Z op = silu_mul_quant 2025-05-07T20:31:46.6988263Z if compiled: 2025-05-07T20:31:46.6988579Z op = torch.compile(op) 2025-05-07T20:31:46.6988951Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.6989292Z 2025-05-07T20:31:46.6989552Z y_fp8, y_scale = fn() 2025-05-07T20:31:46.6990210Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:46.6990606Z 2025-05-07T20:31:46.6990896Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:46.6991295Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:46.6991650Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:46.6992053Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:46.6992542Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:46.6992967Z 2025-05-07T20:31:46.6993241Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:46.6993511Z 2025-05-07T20:31:46.6993644Z moe/activation_test.py:126: 2025-05-07T20:31:46.6994042Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.6994507Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:46.6994953Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:46.6995984Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:46.6996977Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:46.6997725Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:46.6998660Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:46.6999607Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:46.7000599Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:46.7001612Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:46.7002635Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:46.7003619Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:46.7004514Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:46.7005348Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:46.7006057Z fn() 2025-05-07T20:31:46.7006734Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:46.7008582Z self.fn.run( 2025-05-07T20:31:46.7009391Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:46.7010157Z kernel = self.compile( 2025-05-07T20:31:46.7010916Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:46.7011818Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:46.7012365Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.7012690Z 2025-05-07T20:31:46.7012979Z self = 2025-05-07T20:31:46.7014456Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:46.7016371Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c7f67a950>} 2025-05-07T20:31:46.7018204Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:46.7019555Z context = 2025-05-07T20:31:46.7020048Z 2025-05-07T20:31:46.7020274Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:46.7020969Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:46.7021587Z module_map=module_map) 2025-05-07T20:31:46.7022054Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:46.7022521Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:46.7022874Z E ^ 2025-05-07T20:31:46.7023528Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:46.7024191Z 2025-05-07T20:31:46.7024789Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:46.7025536Z 2025-05-07T20:31:46.7025689Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:46.7026240Z self=, 2025-05-07T20:31:46.7026755Z T=16384, 2025-05-07T20:31:46.7027012Z D=7168, 2025-05-07T20:31:46.7027274Z scale_ub=1200.0, 2025-05-07T20:31:46.7027563Z contiguous=False, 2025-05-07T20:31:46.7027857Z compiled=False, 2025-05-07T20:31:46.7028130Z ) 2025-05-07T20:31:46.7028540Z self = 2025-05-07T20:31:46.7029194Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:46.7029565Z 2025-05-07T20:31:46.7029672Z @given( 2025-05-07T20:31:46.7029972Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:46.7030370Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:46.7030767Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:46.7031198Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:46.7031620Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:46.7031996Z ) 2025-05-07T20:31:46.7032452Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:46.7033024Z def test_silu_mul_quant( 2025-05-07T20:31:46.7033341Z self, 2025-05-07T20:31:46.7033593Z T: int, 2025-05-07T20:31:46.7033849Z D: int, 2025-05-07T20:31:46.7034138Z scale_ub: Optional[float], 2025-05-07T20:31:46.7034496Z contiguous: bool, 2025-05-07T20:31:46.7034910Z compiled: bool, 2025-05-07T20:31:46.7035204Z ) -> None: 2025-05-07T20:31:46.7035603Z torch.manual_seed(2025) 2025-05-07T20:31:46.7035925Z 2025-05-07T20:31:46.7036270Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:46.7036714Z 2025-05-07T20:31:46.7036969Z x_sign = torch.sign(x) 2025-05-07T20:31:46.7037336Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:46.7037745Z x = x_sign * x_clamp 2025-05-07T20:31:46.7038060Z x0 = x[:, :D] 2025-05-07T20:31:46.7038345Z x1 = x[:, D:] 2025-05-07T20:31:46.7038623Z 2025-05-07T20:31:46.7038875Z if contiguous: 2025-05-07T20:31:46.7039175Z x0 = x0.contiguous() 2025-05-07T20:31:46.7039522Z x1 = x1.contiguous() 2025-05-07T20:31:46.7039851Z 2025-05-07T20:31:46.7040110Z if scale_ub is not None: 2025-05-07T20:31:46.7040486Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:46.7040966Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:46.7041389Z ) 2025-05-07T20:31:46.7041664Z else: 2025-05-07T20:31:46.7041955Z scale_ub_tensor = None 2025-05-07T20:31:46.7042304Z 2025-05-07T20:31:46.7042622Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:46.7043060Z op = silu_mul_quant 2025-05-07T20:31:46.7043414Z if compiled: 2025-05-07T20:31:46.7043755Z op = torch.compile(op) 2025-05-07T20:31:46.7044173Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.7044519Z 2025-05-07T20:31:46.7044762Z > y_fp8, y_scale = fn() 2025-05-07T20:31:46.7044989Z 2025-05-07T20:31:46.7045122Z moe/activation_test.py:117: 2025-05-07T20:31:46.7045482Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.7045906Z moe/activation_test.py:115: in fn 2025-05-07T20:31:46.7046306Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.7047300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:46.7048269Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:46.7048959Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:46.7049926Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:46.7050807Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:46.7051453Z kernel = self.compile( 2025-05-07T20:31:46.7052113Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:46.7052916Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:46.7053396Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.7053679Z 2025-05-07T20:31:46.7053924Z self = 2025-05-07T20:31:46.7055250Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:46.7057013Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c7e5be4d0>} 2025-05-07T20:31:46.7058859Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:46.7060378Z context = 2025-05-07T20:31:46.7060784Z 2025-05-07T20:31:46.7061125Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:46.7061946Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:46.7062604Z module_map=module_map) 2025-05-07T20:31:46.7063081Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:46.7063511Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:46.7063867Z E ^ 2025-05-07T20:31:46.7064525Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:46.7065167Z 2025-05-07T20:31:46.7065754Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:46.7066474Z 2025-05-07T20:31:46.7066620Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:46.7067189Z self=, 2025-05-07T20:31:46.7067747Z T=1, 2025-05-07T20:31:46.7067996Z D=7168, 2025-05-07T20:31:46.7068270Z scale_ub=None, 2025-05-07T20:31:46.7068574Z contiguous=True, 2025-05-07T20:31:46.7068877Z compiled=True, 2025-05-07T20:31:46.7069163Z ) 2025-05-07T20:31:46.7069607Z self = 2025-05-07T20:31:46.7070263Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:46.7070616Z 2025-05-07T20:31:46.7070719Z @given( 2025-05-07T20:31:46.7071027Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:46.7071440Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:46.7071850Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:46.7072294Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:46.7072734Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:46.7073130Z ) 2025-05-07T20:31:46.7073610Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:46.7074224Z def test_silu_mul_quant( 2025-05-07T20:31:46.7074547Z self, 2025-05-07T20:31:46.7074819Z T: int, 2025-05-07T20:31:46.7075092Z D: int, 2025-05-07T20:31:46.7075385Z scale_ub: Optional[float], 2025-05-07T20:31:46.7075755Z contiguous: bool, 2025-05-07T20:31:46.7076085Z compiled: bool, 2025-05-07T20:31:46.7076379Z ) -> None: 2025-05-07T20:31:46.7076666Z torch.manual_seed(2025) 2025-05-07T20:31:46.7076992Z 2025-05-07T20:31:46.7077342Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:46.7077793Z 2025-05-07T20:31:46.7078050Z x_sign = torch.sign(x) 2025-05-07T20:31:46.7078433Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:46.7078846Z x = x_sign * x_clamp 2025-05-07T20:31:46.7079168Z x0 = x[:, :D] 2025-05-07T20:31:46.7079452Z x1 = x[:, D:] 2025-05-07T20:31:46.7079740Z 2025-05-07T20:31:46.7080003Z if contiguous: 2025-05-07T20:31:46.7080332Z x0 = x0.contiguous() 2025-05-07T20:31:46.7080692Z x1 = x1.contiguous() 2025-05-07T20:31:46.7081035Z 2025-05-07T20:31:46.7081308Z if scale_ub is not None: 2025-05-07T20:31:46.7081691Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:46.7082158Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:46.7082592Z ) 2025-05-07T20:31:46.7082864Z else: 2025-05-07T20:31:46.7083161Z scale_ub_tensor = None 2025-05-07T20:31:46.7083514Z 2025-05-07T20:31:46.7083833Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:46.7084279Z op = silu_mul_quant 2025-05-07T20:31:46.7084638Z if compiled: 2025-05-07T20:31:46.7084984Z op = torch.compile(op) 2025-05-07T20:31:46.7085403Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.7085793Z 2025-05-07T20:31:46.7086174Z y_fp8, y_scale = fn() 2025-05-07T20:31:46.7086571Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:46.7086989Z 2025-05-07T20:31:46.7087394Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:46.7087838Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:46.7088231Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:46.7088651Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:46.7089123Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:46.7089544Z 2025-05-07T20:31:46.7089818Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:46.7090294Z 2025-05-07T20:31:46.7090436Z moe/activation_test.py:126: 2025-05-07T20:31:46.7090829Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.7091289Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:46.7091736Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:46.7092821Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:46.7093857Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:46.7094595Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:46.7095530Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:46.7096461Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:46.7097446Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:46.7098467Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:46.7099477Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:46.7100545Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:46.7101425Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:46.7102248Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:46.7102951Z fn() 2025-05-07T20:31:46.7103677Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:46.7104553Z self.fn.run( 2025-05-07T20:31:46.7105233Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:46.7105961Z kernel = self.compile( 2025-05-07T20:31:46.7106749Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:46.7107658Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:46.7108205Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.7108532Z 2025-05-07T20:31:46.7108809Z self = 2025-05-07T20:31:46.7110290Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:46.7112208Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1cb1f90160>} 2025-05-07T20:31:46.7114061Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:46.7115669Z context = 2025-05-07T20:31:46.7116059Z 2025-05-07T20:31:46.7116427Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:46.7117167Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:46.7117801Z module_map=module_map) 2025-05-07T20:31:46.7118278Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:46.7118751Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:46.7119114Z E ^ 2025-05-07T20:31:46.7119778Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:46.7120445Z 2025-05-07T20:31:46.7121037Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:46.7121768Z 2025-05-07T20:31:46.7121909Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:46.7122482Z self=, 2025-05-07T20:31:46.7123023Z T=4096, 2025-05-07T20:31:46.7123289Z D=5120, 2025-05-07T20:31:46.7123551Z scale_ub=None, 2025-05-07T20:31:46.7123838Z contiguous=False, 2025-05-07T20:31:46.7124151Z compiled=False, 2025-05-07T20:31:46.7124435Z ) 2025-05-07T20:31:46.7124857Z self = 2025-05-07T20:31:46.7125541Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:46.7125917Z 2025-05-07T20:31:46.7126032Z @given( 2025-05-07T20:31:46.7126338Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:46.7126769Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:46.7127189Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:46.7127639Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:46.7128086Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:46.7128488Z ) 2025-05-07T20:31:46.7128978Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:46.7129571Z def test_silu_mul_quant( 2025-05-07T20:31:46.7129904Z self, 2025-05-07T20:31:46.7130168Z T: int, 2025-05-07T20:31:46.7130430Z D: int, 2025-05-07T20:31:46.7130733Z scale_ub: Optional[float], 2025-05-07T20:31:46.7131104Z contiguous: bool, 2025-05-07T20:31:46.7131432Z compiled: bool, 2025-05-07T20:31:46.7131742Z ) -> None: 2025-05-07T20:31:46.7132044Z torch.manual_seed(2025) 2025-05-07T20:31:46.7132367Z 2025-05-07T20:31:46.7132731Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:46.7133189Z 2025-05-07T20:31:46.7133450Z x_sign = torch.sign(x) 2025-05-07T20:31:46.7133835Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:46.7134253Z x = x_sign * x_clamp 2025-05-07T20:31:46.7134602Z x0 = x[:, :D] 2025-05-07T20:31:46.7134902Z x1 = x[:, D:] 2025-05-07T20:31:46.7135198Z 2025-05-07T20:31:46.7135465Z if contiguous: 2025-05-07T20:31:46.7135787Z x0 = x0.contiguous() 2025-05-07T20:31:46.7136153Z x1 = x1.contiguous() 2025-05-07T20:31:46.7136533Z 2025-05-07T20:31:46.7136818Z if scale_ub is not None: 2025-05-07T20:31:46.7137215Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:46.7137689Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:46.7138123Z ) 2025-05-07T20:31:46.7138394Z else: 2025-05-07T20:31:46.7138692Z scale_ub_tensor = None 2025-05-07T20:31:46.7139046Z 2025-05-07T20:31:46.7139376Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:46.7139928Z op = silu_mul_quant 2025-05-07T20:31:46.7140274Z if compiled: 2025-05-07T20:31:46.7140616Z op = torch.compile(op) 2025-05-07T20:31:46.7141145Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.7141518Z 2025-05-07T20:31:46.7141780Z > y_fp8, y_scale = fn() 2025-05-07T20:31:46.7142098Z 2025-05-07T20:31:46.7142228Z moe/activation_test.py:117: 2025-05-07T20:31:46.7142625Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.7143078Z moe/activation_test.py:115: in fn 2025-05-07T20:31:46.7143480Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.7144443Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:46.7145377Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:46.7146095Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:46.7147010Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:46.7147900Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:46.7148648Z kernel = self.compile( 2025-05-07T20:31:46.7158916Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:46.7159818Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:46.7160370Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.7160685Z 2025-05-07T20:31:46.7160980Z self = 2025-05-07T20:31:46.7162457Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:46.7164344Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c7f67ab90>} 2025-05-07T20:31:46.7166199Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:46.7167632Z context = 2025-05-07T20:31:46.7168029Z 2025-05-07T20:31:46.7168273Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:46.7169015Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:46.7169688Z module_map=module_map) 2025-05-07T20:31:46.7170209Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:46.7170707Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:46.7171066Z E ^ 2025-05-07T20:31:46.7171714Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:46.7172339Z 2025-05-07T20:31:46.7172928Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:46.7173640Z 2025-05-07T20:31:46.7173782Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:46.7174345Z self=, 2025-05-07T20:31:46.7174908Z T=4096, 2025-05-07T20:31:46.7175174Z D=7168, 2025-05-07T20:31:46.7175436Z scale_ub=None, 2025-05-07T20:31:46.7175741Z contiguous=False, 2025-05-07T20:31:46.7176058Z compiled=False, 2025-05-07T20:31:46.7176344Z ) 2025-05-07T20:31:46.7176820Z self = 2025-05-07T20:31:46.7177525Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:46.7177906Z 2025-05-07T20:31:46.7178013Z @given( 2025-05-07T20:31:46.7178341Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:46.7178951Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:46.7179469Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:46.7180058Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:46.7180528Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:46.7180931Z ) 2025-05-07T20:31:46.7181418Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:46.7182033Z def test_silu_mul_quant( 2025-05-07T20:31:46.7182374Z self, 2025-05-07T20:31:46.7182641Z T: int, 2025-05-07T20:31:46.7182926Z D: int, 2025-05-07T20:31:46.7183229Z scale_ub: Optional[float], 2025-05-07T20:31:46.7183595Z contiguous: bool, 2025-05-07T20:31:46.7183928Z compiled: bool, 2025-05-07T20:31:46.7184242Z ) -> None: 2025-05-07T20:31:46.7184536Z torch.manual_seed(2025) 2025-05-07T20:31:46.7184882Z 2025-05-07T20:31:46.7185282Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:46.7185772Z 2025-05-07T20:31:46.7186073Z x_sign = torch.sign(x) 2025-05-07T20:31:46.7186514Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:46.7186991Z x = x_sign * x_clamp 2025-05-07T20:31:46.7187339Z x0 = x[:, :D] 2025-05-07T20:31:46.7187647Z x1 = x[:, D:] 2025-05-07T20:31:46.7187949Z 2025-05-07T20:31:46.7188206Z if contiguous: 2025-05-07T20:31:46.7188524Z x0 = x0.contiguous() 2025-05-07T20:31:46.7188882Z x1 = x1.contiguous() 2025-05-07T20:31:46.7189215Z 2025-05-07T20:31:46.7189497Z if scale_ub is not None: 2025-05-07T20:31:46.7190117Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:46.7190577Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:46.7191003Z ) 2025-05-07T20:31:46.7191265Z else: 2025-05-07T20:31:46.7191552Z scale_ub_tensor = None 2025-05-07T20:31:46.7191907Z 2025-05-07T20:31:46.7192219Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:46.7192643Z op = silu_mul_quant 2025-05-07T20:31:46.7192982Z if compiled: 2025-05-07T20:31:46.7193313Z op = torch.compile(op) 2025-05-07T20:31:46.7193699Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.7194074Z 2025-05-07T20:31:46.7194331Z > y_fp8, y_scale = fn() 2025-05-07T20:31:46.7194557Z 2025-05-07T20:31:46.7194688Z moe/activation_test.py:117: 2025-05-07T20:31:46.7195082Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.7195520Z moe/activation_test.py:115: in fn 2025-05-07T20:31:46.7195895Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.7196855Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:46.7197841Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:46.7198582Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:46.7199502Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:46.7200285Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:46.7200819Z kernel = self.compile( 2025-05-07T20:31:46.7201350Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:46.7202006Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:46.7202400Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.7202627Z 2025-05-07T20:31:46.7202832Z self = 2025-05-07T20:31:46.7204130Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:46.7205614Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c84cdf370>} 2025-05-07T20:31:46.7206945Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:46.7207957Z context = 2025-05-07T20:31:46.7208239Z 2025-05-07T20:31:46.7208404Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:46.7208922Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:46.7209392Z module_map=module_map) 2025-05-07T20:31:46.7209763Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:46.7210121Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:46.7210382Z E ^ 2025-05-07T20:31:46.7210848Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:46.7211292Z 2025-05-07T20:31:46.7211703Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:46.7212225Z 2025-05-07T20:31:46.7212331Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:46.7212744Z self=, 2025-05-07T20:31:46.7213140Z T=128, 2025-05-07T20:31:46.7213324Z D=7168, 2025-05-07T20:31:46.7213522Z scale_ub=None, 2025-05-07T20:31:46.7213741Z contiguous=False, 2025-05-07T20:31:46.7213967Z compiled=True, 2025-05-07T20:31:46.7214183Z ) 2025-05-07T20:31:46.7214507Z self = 2025-05-07T20:31:46.7215005Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:46.7215283Z 2025-05-07T20:31:46.7215362Z @given( 2025-05-07T20:31:46.7215600Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:46.7215908Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:46.7216217Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:46.7216544Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:46.7216874Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:46.7217154Z ) 2025-05-07T20:31:46.7217509Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:46.7217950Z def test_silu_mul_quant( 2025-05-07T20:31:46.7218187Z self, 2025-05-07T20:31:46.7218392Z T: int, 2025-05-07T20:31:46.7218600Z D: int, 2025-05-07T20:31:46.7218816Z scale_ub: Optional[float], 2025-05-07T20:31:46.7219085Z contiguous: bool, 2025-05-07T20:31:46.7219329Z compiled: bool, 2025-05-07T20:31:46.7219552Z ) -> None: 2025-05-07T20:31:46.7219768Z torch.manual_seed(2025) 2025-05-07T20:31:46.7220125Z 2025-05-07T20:31:46.7220392Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:46.7220729Z 2025-05-07T20:31:46.7220926Z x_sign = torch.sign(x) 2025-05-07T20:31:46.7221214Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:46.7221518Z x = x_sign * x_clamp 2025-05-07T20:31:46.7221757Z x0 = x[:, :D] 2025-05-07T20:31:46.7221975Z x1 = x[:, D:] 2025-05-07T20:31:46.7222176Z 2025-05-07T20:31:46.7222362Z if contiguous: 2025-05-07T20:31:46.7222595Z x0 = x0.contiguous() 2025-05-07T20:31:46.7222847Z x1 = x1.contiguous() 2025-05-07T20:31:46.7223086Z 2025-05-07T20:31:46.7223378Z if scale_ub is not None: 2025-05-07T20:31:46.7223650Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:46.7224055Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:46.7224369Z ) 2025-05-07T20:31:46.7224561Z else: 2025-05-07T20:31:46.7224778Z scale_ub_tensor = None 2025-05-07T20:31:46.7225026Z 2025-05-07T20:31:46.7225255Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:46.7225569Z op = silu_mul_quant 2025-05-07T20:31:46.7225822Z if compiled: 2025-05-07T20:31:46.7226070Z op = torch.compile(op) 2025-05-07T20:31:46.7226360Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.7226637Z 2025-05-07T20:31:46.7226835Z y_fp8, y_scale = fn() 2025-05-07T20:31:46.7227115Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:46.7227407Z 2025-05-07T20:31:46.7227644Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:46.7227982Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:46.7228284Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:46.7228602Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:46.7228953Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:46.7229262Z 2025-05-07T20:31:46.7229469Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:46.7229662Z 2025-05-07T20:31:46.7229771Z moe/activation_test.py:126: 2025-05-07T20:31:46.7230066Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.7230405Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:46.7230734Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:46.7231529Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:46.7232280Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:46.7232836Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:46.7233511Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:46.7234187Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:46.7234901Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:46.7235645Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:46.7236410Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:46.7237154Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:46.7237790Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:46.7238396Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:46.7238901Z fn() 2025-05-07T20:31:46.7239406Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:46.7239984Z self.fn.run( 2025-05-07T20:31:46.7240449Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:46.7240970Z kernel = self.compile( 2025-05-07T20:31:46.7241513Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:46.7242171Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:46.7242559Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.7242790Z 2025-05-07T20:31:46.7242996Z self = 2025-05-07T20:31:46.7244268Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:46.7245641Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c7f7af9a0>} 2025-05-07T20:31:46.7246977Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:46.7247984Z context = 2025-05-07T20:31:46.7248275Z 2025-05-07T20:31:46.7248441Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:46.7248968Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:46.7249440Z module_map=module_map) 2025-05-07T20:31:46.7249801Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:46.7250155Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:46.7250422Z E ^ 2025-05-07T20:31:46.7250877Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:46.7251324Z 2025-05-07T20:31:46.7251735Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:46.7252247Z 2025-05-07T20:31:46.7252352Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:46.7252767Z self=, 2025-05-07T20:31:46.7253166Z T=128, 2025-05-07T20:31:46.7253361Z D=7168, 2025-05-07T20:31:46.7253565Z scale_ub=None, 2025-05-07T20:31:46.7253779Z contiguous=False, 2025-05-07T20:31:46.7254012Z compiled=False, 2025-05-07T20:31:46.7254220Z ) 2025-05-07T20:31:46.7254542Z self = 2025-05-07T20:31:46.7255031Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:46.7255294Z 2025-05-07T20:31:46.7255379Z @given( 2025-05-07T20:31:46.7255608Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:46.7255925Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:46.7256235Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:46.7256562Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:46.7256883Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:46.7257169Z ) 2025-05-07T20:31:46.7257522Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:46.7257960Z def test_silu_mul_quant( 2025-05-07T20:31:46.7258208Z self, 2025-05-07T20:31:46.7258405Z T: int, 2025-05-07T20:31:46.7258598Z D: int, 2025-05-07T20:31:46.7258823Z scale_ub: Optional[float], 2025-05-07T20:31:46.7259096Z contiguous: bool, 2025-05-07T20:31:46.7259330Z compiled: bool, 2025-05-07T20:31:46.7259556Z ) -> None: 2025-05-07T20:31:46.7259775Z torch.manual_seed(2025) 2025-05-07T20:31:46.7260083Z 2025-05-07T20:31:46.7260358Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:46.7260696Z 2025-05-07T20:31:46.7260894Z x_sign = torch.sign(x) 2025-05-07T20:31:46.7261179Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:46.7261489Z x = x_sign * x_clamp 2025-05-07T20:31:46.7261729Z x0 = x[:, :D] 2025-05-07T20:31:46.7261942Z x1 = x[:, D:] 2025-05-07T20:31:46.7262153Z 2025-05-07T20:31:46.7262341Z if contiguous: 2025-05-07T20:31:46.7262570Z x0 = x0.contiguous() 2025-05-07T20:31:46.7262921Z x1 = x1.contiguous() 2025-05-07T20:31:46.7263159Z 2025-05-07T20:31:46.7263427Z if scale_ub is not None: 2025-05-07T20:31:46.7263704Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:46.7264039Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:46.7264338Z ) 2025-05-07T20:31:46.7264535Z else: 2025-05-07T20:31:46.7264747Z scale_ub_tensor = None 2025-05-07T20:31:46.7264990Z 2025-05-07T20:31:46.7265222Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:46.7265542Z op = silu_mul_quant 2025-05-07T20:31:46.7265794Z if compiled: 2025-05-07T20:31:46.7266041Z op = torch.compile(op) 2025-05-07T20:31:46.7266343Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.7266663Z 2025-05-07T20:31:46.7266857Z > y_fp8, y_scale = fn() 2025-05-07T20:31:46.7267029Z 2025-05-07T20:31:46.7267130Z moe/activation_test.py:117: 2025-05-07T20:31:46.7267438Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.7267770Z moe/activation_test.py:115: in fn 2025-05-07T20:31:46.7268061Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.7268749Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:46.7269441Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:46.7269972Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:46.7270655Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:46.7271311Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:46.7271841Z kernel = self.compile( 2025-05-07T20:31:46.7272377Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:46.7273044Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:46.7273438Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.7273671Z 2025-05-07T20:31:46.7273879Z self = 2025-05-07T20:31:46.7274951Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:46.7276308Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c84b2edd0>} 2025-05-07T20:31:46.7277637Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:46.7278680Z context = 2025-05-07T20:31:46.7278971Z 2025-05-07T20:31:46.7279140Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:46.7279660Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:46.7280118Z module_map=module_map) 2025-05-07T20:31:46.7280487Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:46.7280844Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:46.7281098Z E ^ 2025-05-07T20:31:46.7281567Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:46.7282017Z 2025-05-07T20:31:46.7282430Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:46.7283031Z 2025-05-07T20:31:46.7283143Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:46.7283624Z self=, 2025-05-07T20:31:46.7284030Z T=4096, 2025-05-07T20:31:46.7284227Z D=5120, 2025-05-07T20:31:46.7284417Z scale_ub=1200.0, 2025-05-07T20:31:46.7284650Z contiguous=True, 2025-05-07T20:31:46.7284874Z compiled=False, 2025-05-07T20:31:46.7285087Z ) 2025-05-07T20:31:46.7285404Z self = 2025-05-07T20:31:46.7285895Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:46.7286163Z 2025-05-07T20:31:46.7286249Z @given( 2025-05-07T20:31:46.7286473Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:46.7286787Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:46.7287093Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:46.7287424Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:46.7287750Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:46.7288045Z ) 2025-05-07T20:31:46.7288388Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:46.7288829Z def test_silu_mul_quant( 2025-05-07T20:31:46.7289075Z self, 2025-05-07T20:31:46.7289272Z T: int, 2025-05-07T20:31:46.7289466Z D: int, 2025-05-07T20:31:46.7289686Z scale_ub: Optional[float], 2025-05-07T20:31:46.7290233Z contiguous: bool, 2025-05-07T20:31:46.7290534Z compiled: bool, 2025-05-07T20:31:46.7290755Z ) -> None: 2025-05-07T20:31:46.7290973Z torch.manual_seed(2025) 2025-05-07T20:31:46.7291209Z 2025-05-07T20:31:46.7291481Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:46.7291818Z 2025-05-07T20:31:46.7292014Z x_sign = torch.sign(x) 2025-05-07T20:31:46.7292309Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:46.7292625Z x = x_sign * x_clamp 2025-05-07T20:31:46.7292860Z x0 = x[:, :D] 2025-05-07T20:31:46.7293081Z x1 = x[:, D:] 2025-05-07T20:31:46.7293292Z 2025-05-07T20:31:46.7293474Z if contiguous: 2025-05-07T20:31:46.7293706Z x0 = x0.contiguous() 2025-05-07T20:31:46.7293966Z x1 = x1.contiguous() 2025-05-07T20:31:46.7294207Z 2025-05-07T20:31:46.7294395Z if scale_ub is not None: 2025-05-07T20:31:46.7294671Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:46.7295004Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:46.7295302Z ) 2025-05-07T20:31:46.7295499Z else: 2025-05-07T20:31:46.7295715Z scale_ub_tensor = None 2025-05-07T20:31:46.7295961Z 2025-05-07T20:31:46.7296195Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:46.7296507Z op = silu_mul_quant 2025-05-07T20:31:46.7296760Z if compiled: 2025-05-07T20:31:46.7297008Z op = torch.compile(op) 2025-05-07T20:31:46.7297312Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.7297582Z 2025-05-07T20:31:46.7297777Z > y_fp8, y_scale = fn() 2025-05-07T20:31:46.7297941Z 2025-05-07T20:31:46.7298045Z moe/activation_test.py:117: 2025-05-07T20:31:46.7298336Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.7298659Z moe/activation_test.py:115: in fn 2025-05-07T20:31:46.7298946Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.7299632Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:46.7300410Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:46.7300944Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:46.7301623Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:46.7302642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:46.7303167Z kernel = self.compile( 2025-05-07T20:31:46.7303706Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:46.7304362Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:46.7304748Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.7304977Z 2025-05-07T20:31:46.7305183Z self = 2025-05-07T20:31:46.7306256Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:46.7307692Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c7c14c940>} 2025-05-07T20:31:46.7309021Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:46.7310029Z context = 2025-05-07T20:31:46.7310322Z 2025-05-07T20:31:46.7310488Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:46.7311003Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:46.7311464Z module_map=module_map) 2025-05-07T20:31:46.7311823Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:46.7312174Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:46.7312436Z E ^ 2025-05-07T20:31:46.7312897Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:46.7313347Z 2025-05-07T20:31:46.7313764Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:46.7314273Z 2025-05-07T20:31:46.7314376Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:46.7314781Z self=, 2025-05-07T20:31:46.7315168Z T=1, 2025-05-07T20:31:46.7315355Z D=5120, 2025-05-07T20:31:46.7315550Z scale_ub=None, 2025-05-07T20:31:46.7315757Z contiguous=True, 2025-05-07T20:31:46.7315980Z compiled=True, 2025-05-07T20:31:46.7316181Z ) 2025-05-07T20:31:46.7316492Z self = 2025-05-07T20:31:46.7316969Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:46.7317226Z 2025-05-07T20:31:46.7317315Z @given( 2025-05-07T20:31:46.7317545Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:46.7317857Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:46.7318162Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:46.7318493Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:46.7318811Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:46.7319101Z ) 2025-05-07T20:31:46.7319453Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:46.7319884Z def test_silu_mul_quant( 2025-05-07T20:31:46.7320126Z self, 2025-05-07T20:31:46.7320318Z T: int, 2025-05-07T20:31:46.7320531Z D: int, 2025-05-07T20:31:46.7326794Z scale_ub: Optional[float], 2025-05-07T20:31:46.7326894Z contiguous: bool, 2025-05-07T20:31:46.7326994Z compiled: bool, 2025-05-07T20:31:46.7327078Z ) -> None: 2025-05-07T20:31:46.7327299Z torch.manual_seed(2025) 2025-05-07T20:31:46.7327386Z 2025-05-07T20:31:46.7327642Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:46.7327721Z 2025-05-07T20:31:46.7327825Z x_sign = torch.sign(x) 2025-05-07T20:31:46.7327956Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:46.7328050Z x = x_sign * x_clamp 2025-05-07T20:31:46.7328142Z x0 = x[:, :D] 2025-05-07T20:31:46.7328225Z x1 = x[:, D:] 2025-05-07T20:31:46.7328302Z 2025-05-07T20:31:46.7328401Z if contiguous: 2025-05-07T20:31:46.7328496Z x0 = x0.contiguous() 2025-05-07T20:31:46.7328595Z x1 = x1.contiguous() 2025-05-07T20:31:46.7328672Z 2025-05-07T20:31:46.7328765Z if scale_ub is not None: 2025-05-07T20:31:46.7328885Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:46.7329026Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:46.7329111Z ) 2025-05-07T20:31:46.7329203Z else: 2025-05-07T20:31:46.7329302Z scale_ub_tensor = None 2025-05-07T20:31:46.7329385Z 2025-05-07T20:31:46.7329529Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:46.7329623Z op = silu_mul_quant 2025-05-07T20:31:46.7329711Z if compiled: 2025-05-07T20:31:46.7329825Z op = torch.compile(op) 2025-05-07T20:31:46.7329933Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.7330015Z 2025-05-07T20:31:46.7330110Z y_fp8, y_scale = fn() 2025-05-07T20:31:46.7330234Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:46.7330318Z 2025-05-07T20:31:46.7330460Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:46.7330566Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:46.7330676Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:46.7330802Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:46.7330952Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:46.7331037Z 2025-05-07T20:31:46.7331145Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:46.7331150Z 2025-05-07T20:31:46.7331260Z moe/activation_test.py:126: 2025-05-07T20:31:46.7331398Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.7331507Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:46.7331651Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:46.7332225Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:46.7332331Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:46.7332703Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:46.7332928Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:46.7333319Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:46.7333580Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:46.7333983Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:46.7334243Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:46.7334617Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:46.7334800Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:46.7335148Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:46.7335228Z fn() 2025-05-07T20:31:46.7335733Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:46.7335893Z self.fn.run( 2025-05-07T20:31:46.7336237Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:46.7336341Z kernel = self.compile( 2025-05-07T20:31:46.7336722Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:46.7336908Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:46.7337039Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.7337044Z 2025-05-07T20:31:46.7337253Z self = 2025-05-07T20:31:46.7338035Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:46.7338555Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c7e5bea70>} 2025-05-07T20:31:46.7339308Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:46.7339503Z context = 2025-05-07T20:31:46.7339508Z 2025-05-07T20:31:46.7339677Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:46.7340057Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:46.7340168Z module_map=module_map) 2025-05-07T20:31:46.7340343Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:46.7340453Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:46.7340534Z E ^ 2025-05-07T20:31:46.7340901Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:46.7340906Z 2025-05-07T20:31:46.7341326Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:46.7341331Z 2025-05-07T20:31:46.7341443Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:46.7341667Z self=, 2025-05-07T20:31:46.7341747Z T=2048, 2025-05-07T20:31:46.7341831Z D=5120, 2025-05-07T20:31:46.7341918Z scale_ub=None, 2025-05-07T20:31:46.7342005Z contiguous=True, 2025-05-07T20:31:46.7342095Z compiled=True, 2025-05-07T20:31:46.7342174Z ) 2025-05-07T20:31:46.7342391Z self = 2025-05-07T20:31:46.7342574Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:46.7342587Z 2025-05-07T20:31:46.7342667Z @given( 2025-05-07T20:31:46.7342799Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:46.7342901Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:46.7343018Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:46.7343144Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:46.7343263Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:46.7343343Z ) 2025-05-07T20:31:46.7343599Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:46.7343697Z def test_silu_mul_quant( 2025-05-07T20:31:46.7343777Z self, 2025-05-07T20:31:46.7343869Z T: int, 2025-05-07T20:31:46.7343949Z D: int, 2025-05-07T20:31:46.7344051Z scale_ub: Optional[float], 2025-05-07T20:31:46.7344246Z contiguous: bool, 2025-05-07T20:31:46.7344336Z compiled: bool, 2025-05-07T20:31:46.7344427Z ) -> None: 2025-05-07T20:31:46.7344602Z torch.manual_seed(2025) 2025-05-07T20:31:46.7344682Z 2025-05-07T20:31:46.7344867Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:46.7344948Z 2025-05-07T20:31:46.7345047Z x_sign = torch.sign(x) 2025-05-07T20:31:46.7345183Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:46.7345276Z x = x_sign * x_clamp 2025-05-07T20:31:46.7345360Z x0 = x[:, :D] 2025-05-07T20:31:46.7345452Z x1 = x[:, D:] 2025-05-07T20:31:46.7345529Z 2025-05-07T20:31:46.7345616Z if contiguous: 2025-05-07T20:31:46.7345723Z x0 = x0.contiguous() 2025-05-07T20:31:46.7345819Z x1 = x1.contiguous() 2025-05-07T20:31:46.7345903Z 2025-05-07T20:31:46.7345999Z if scale_ub is not None: 2025-05-07T20:31:46.7346108Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:46.7346265Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:46.7346346Z ) 2025-05-07T20:31:46.7346432Z else: 2025-05-07T20:31:46.7346539Z scale_ub_tensor = None 2025-05-07T20:31:46.7346616Z 2025-05-07T20:31:46.7346750Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:46.7346850Z op = silu_mul_quant 2025-05-07T20:31:46.7346940Z if compiled: 2025-05-07T20:31:46.7347042Z op = torch.compile(op) 2025-05-07T20:31:46.7347158Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.7347236Z 2025-05-07T20:31:46.7347338Z y_fp8, y_scale = fn() 2025-05-07T20:31:46.7347465Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:46.7347540Z 2025-05-07T20:31:46.7347686Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:46.7347790Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:46.7347897Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:46.7348029Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:46.7348172Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:46.7348250Z 2025-05-07T20:31:46.7348353Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:46.7348358Z 2025-05-07T20:31:46.7348470Z moe/activation_test.py:126: 2025-05-07T20:31:46.7348599Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.7348707Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:46.7348849Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:46.7349406Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:46.7349517Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:46.7349878Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:46.7350114Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:46.7350485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:46.7350742Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:46.7351143Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:46.7351394Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:46.7351767Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:46.7351939Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:46.7352280Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:46.7352448Z fn() 2025-05-07T20:31:46.7352980Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:46.7353064Z self.fn.run( 2025-05-07T20:31:46.7353412Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:46.7353510Z kernel = self.compile( 2025-05-07T20:31:46.7353894Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:46.7354077Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:46.7354206Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.7354210Z 2025-05-07T20:31:46.7354422Z self = 2025-05-07T20:31:46.7355200Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:46.7355712Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c6c9f8dc0>} 2025-05-07T20:31:46.7356461Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:46.7356654Z context = 2025-05-07T20:31:46.7356659Z 2025-05-07T20:31:46.7356832Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:46.7357099Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:46.7357212Z module_map=module_map) 2025-05-07T20:31:46.7357387Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:46.7357492Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:46.7357571Z E ^ 2025-05-07T20:31:46.7357935Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:46.7357940Z 2025-05-07T20:31:46.7358360Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:46.7358365Z 2025-05-07T20:31:46.7358477Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:46.7358701Z self=, 2025-05-07T20:31:46.7358782Z T=128, 2025-05-07T20:31:46.7358867Z D=5120, 2025-05-07T20:31:46.7358952Z scale_ub=None, 2025-05-07T20:31:46.7359045Z contiguous=True, 2025-05-07T20:31:46.7359130Z compiled=True, 2025-05-07T20:31:46.7359212Z ) 2025-05-07T20:31:46.7359436Z self = 2025-05-07T20:31:46.7359614Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:46.7359619Z 2025-05-07T20:31:46.7359698Z @given( 2025-05-07T20:31:46.7359826Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:46.7359932Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:46.7360049Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:46.7360174Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:46.7360291Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:46.7360375Z ) 2025-05-07T20:31:46.7360629Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:46.7360726Z def test_silu_mul_quant( 2025-05-07T20:31:46.7360811Z self, 2025-05-07T20:31:46.7360891Z T: int, 2025-05-07T20:31:46.7361058Z D: int, 2025-05-07T20:31:46.7361166Z scale_ub: Optional[float], 2025-05-07T20:31:46.7361259Z contiguous: bool, 2025-05-07T20:31:46.7361422Z compiled: bool, 2025-05-07T20:31:46.7361511Z ) -> None: 2025-05-07T20:31:46.7361609Z torch.manual_seed(2025) 2025-05-07T20:31:46.7361684Z 2025-05-07T20:31:46.7361858Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:46.7361933Z 2025-05-07T20:31:46.7362033Z x_sign = torch.sign(x) 2025-05-07T20:31:46.7362160Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:46.7362251Z x = x_sign * x_clamp 2025-05-07T20:31:46.7362341Z x0 = x[:, :D] 2025-05-07T20:31:46.7362423Z x1 = x[:, D:] 2025-05-07T20:31:46.7362498Z 2025-05-07T20:31:46.7362589Z if contiguous: 2025-05-07T20:31:46.7362684Z x0 = x0.contiguous() 2025-05-07T20:31:46.7362775Z x1 = x1.contiguous() 2025-05-07T20:31:46.7362856Z 2025-05-07T20:31:46.7362956Z if scale_ub is not None: 2025-05-07T20:31:46.7363062Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:46.7363211Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:46.7363291Z ) 2025-05-07T20:31:46.7363371Z else: 2025-05-07T20:31:46.7363474Z scale_ub_tensor = None 2025-05-07T20:31:46.7363549Z 2025-05-07T20:31:46.7363687Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:46.7363779Z op = silu_mul_quant 2025-05-07T20:31:46.7363867Z if compiled: 2025-05-07T20:31:46.7363976Z op = torch.compile(op) 2025-05-07T20:31:46.7364083Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.7364158Z 2025-05-07T20:31:46.7364259Z y_fp8, y_scale = fn() 2025-05-07T20:31:46.7364383Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:46.7364458Z 2025-05-07T20:31:46.7364603Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:46.7364713Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:46.7364818Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:46.7364948Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:46.7365089Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:46.7365171Z 2025-05-07T20:31:46.7365272Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:46.7365277Z 2025-05-07T20:31:46.7365376Z moe/activation_test.py:126: 2025-05-07T20:31:46.7365512Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.7365619Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:46.7365756Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:46.7366320Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:46.7366424Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:46.7366805Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:46.7367032Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:46.7367399Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:46.7367660Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:46.7368063Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:46.7368321Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:46.7368693Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:46.7368860Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:46.7369372Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:46.7369455Z fn() 2025-05-07T20:31:46.7369859Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:46.7369950Z self.fn.run( 2025-05-07T20:31:46.7370292Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:46.7370396Z kernel = self.compile( 2025-05-07T20:31:46.7370783Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:46.7370960Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:46.7371095Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.7371100Z 2025-05-07T20:31:46.7371304Z self = 2025-05-07T20:31:46.7372098Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:46.7372608Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c564dcb80>} 2025-05-07T20:31:46.7373346Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:46.7373542Z context = 2025-05-07T20:31:46.7373547Z 2025-05-07T20:31:46.7373712Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:46.7373990Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:46.7374104Z module_map=module_map) 2025-05-07T20:31:46.7374269Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:46.7374378Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:46.7374458Z E ^ 2025-05-07T20:31:46.7374819Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:46.7374824Z 2025-05-07T20:31:46.7375237Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:46.7375242Z 2025-05-07T20:31:46.7375347Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:46.7375577Z self=, 2025-05-07T20:31:46.7375657Z T=4096, 2025-05-07T20:31:46.7375735Z D=5120, 2025-05-07T20:31:46.7375834Z scale_ub=None, 2025-05-07T20:31:46.7375922Z contiguous=True, 2025-05-07T20:31:46.7376017Z compiled=True, 2025-05-07T20:31:46.7376093Z ) 2025-05-07T20:31:46.7376315Z self = 2025-05-07T20:31:46.7376494Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:46.7376499Z 2025-05-07T20:31:46.7376577Z @given( 2025-05-07T20:31:46.7376699Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:46.7376806Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:46.7376923Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:46.7377042Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:46.7377164Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:46.7377241Z ) 2025-05-07T20:31:46.7377497Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:46.7377592Z def test_silu_mul_quant( 2025-05-07T20:31:46.7377757Z self, 2025-05-07T20:31:46.7377843Z T: int, 2025-05-07T20:31:46.7377921Z D: int, 2025-05-07T20:31:46.7378102Z scale_ub: Optional[float], 2025-05-07T20:31:46.7378203Z contiguous: bool, 2025-05-07T20:31:46.7378291Z compiled: bool, 2025-05-07T20:31:46.7378372Z ) -> None: 2025-05-07T20:31:46.7378478Z torch.manual_seed(2025) 2025-05-07T20:31:46.7378553Z 2025-05-07T20:31:46.7378723Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:46.7378808Z 2025-05-07T20:31:46.7378901Z x_sign = torch.sign(x) 2025-05-07T20:31:46.7379032Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:46.7379124Z x = x_sign * x_clamp 2025-05-07T20:31:46.7379206Z x0 = x[:, :D] 2025-05-07T20:31:46.7379298Z x1 = x[:, D:] 2025-05-07T20:31:46.7379372Z 2025-05-07T20:31:46.7379458Z if contiguous: 2025-05-07T20:31:46.7379559Z x0 = x0.contiguous() 2025-05-07T20:31:46.7379655Z x1 = x1.contiguous() 2025-05-07T20:31:46.7379729Z 2025-05-07T20:31:46.7379903Z if scale_ub is not None: 2025-05-07T20:31:46.7380011Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:46.7380150Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:46.7380234Z ) 2025-05-07T20:31:46.7380313Z else: 2025-05-07T20:31:46.7380411Z scale_ub_tensor = None 2025-05-07T20:31:46.7380491Z 2025-05-07T20:31:46.7380623Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:46.7380723Z op = silu_mul_quant 2025-05-07T20:31:46.7380810Z if compiled: 2025-05-07T20:31:46.7380916Z op = torch.compile(op) 2025-05-07T20:31:46.7381030Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.7381104Z 2025-05-07T20:31:46.7381198Z y_fp8, y_scale = fn() 2025-05-07T20:31:46.7381327Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:46.7381408Z 2025-05-07T20:31:46.7381546Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:46.7381659Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:46.7381761Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:46.7381885Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:46.7382032Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:46.7382108Z 2025-05-07T20:31:46.7382216Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:46.7382220Z 2025-05-07T20:31:46.7382321Z moe/activation_test.py:126: 2025-05-07T20:31:46.7382450Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.7382564Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:46.7382698Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:46.7383257Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:46.7383373Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:46.7383739Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:46.7383967Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:46.7384340Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:46.7384601Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:46.7385005Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:46.7385258Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:46.7385637Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:46.7385923Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:46.7386350Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:46.7386440Z fn() 2025-05-07T20:31:46.7386839Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:46.7386927Z self.fn.run( 2025-05-07T20:31:46.7387270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:46.7387366Z kernel = self.compile( 2025-05-07T20:31:46.7387753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:46.7387929Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:46.7388057Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.7388068Z 2025-05-07T20:31:46.7388288Z self = 2025-05-07T20:31:46.7389061Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:46.7389566Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c5658c0d0>} 2025-05-07T20:31:46.7390683Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:46.7390888Z context = 2025-05-07T20:31:46.7390900Z 2025-05-07T20:31:46.7391069Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:46.7391342Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:46.7391456Z module_map=module_map) 2025-05-07T20:31:46.7391623Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:46.7391728Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:46.7391815Z E ^ 2025-05-07T20:31:46.7392172Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:46.7392177Z 2025-05-07T20:31:46.7392604Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:46.7392609Z 2025-05-07T20:31:46.7392715Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:46.7392939Z self=, 2025-05-07T20:31:46.7393032Z T=16384, 2025-05-07T20:31:46.7393111Z D=5120, 2025-05-07T20:31:46.7393197Z scale_ub=None, 2025-05-07T20:31:46.7393295Z contiguous=True, 2025-05-07T20:31:46.7393384Z compiled=True, 2025-05-07T20:31:46.7393461Z ) 2025-05-07T20:31:46.7393687Z self = 2025-05-07T20:31:46.7393862Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:46.7393867Z 2025-05-07T20:31:46.7393954Z @given( 2025-05-07T20:31:46.7394076Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:46.7394180Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:46.7394303Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:46.7394422Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:46.7394539Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:46.7394625Z ) 2025-05-07T20:31:46.7394872Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:46.7395219Z def test_silu_mul_quant( 2025-05-07T20:31:46.7395406Z self, 2025-05-07T20:31:46.7395487Z T: int, 2025-05-07T20:31:46.7395573Z D: int, 2025-05-07T20:31:46.7395674Z scale_ub: Optional[float], 2025-05-07T20:31:46.7395767Z contiguous: bool, 2025-05-07T20:31:46.7395868Z compiled: bool, 2025-05-07T20:31:46.7395948Z ) -> None: 2025-05-07T20:31:46.7396045Z torch.manual_seed(2025) 2025-05-07T20:31:46.7396126Z 2025-05-07T20:31:46.7396299Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:46.7396375Z 2025-05-07T20:31:46.7396474Z x_sign = torch.sign(x) 2025-05-07T20:31:46.7396601Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:46.7396699Z x = x_sign * x_clamp 2025-05-07T20:31:46.7396783Z x0 = x[:, :D] 2025-05-07T20:31:46.7396864Z x1 = x[:, D:] 2025-05-07T20:31:46.7396949Z 2025-05-07T20:31:46.7397035Z if contiguous: 2025-05-07T20:31:46.7397129Z x0 = x0.contiguous() 2025-05-07T20:31:46.7397232Z x1 = x1.contiguous() 2025-05-07T20:31:46.7397307Z 2025-05-07T20:31:46.7397400Z if scale_ub is not None: 2025-05-07T20:31:46.7397512Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:46.7397651Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:46.7397729Z ) 2025-05-07T20:31:46.7397815Z else: 2025-05-07T20:31:46.7397912Z scale_ub_tensor = None 2025-05-07T20:31:46.7397988Z 2025-05-07T20:31:46.7398125Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:46.7398217Z op = silu_mul_quant 2025-05-07T20:31:46.7398311Z if compiled: 2025-05-07T20:31:46.7398413Z op = torch.compile(op) 2025-05-07T20:31:46.7398522Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.7398603Z 2025-05-07T20:31:46.7398704Z y_fp8, y_scale = fn() 2025-05-07T20:31:46.7398829Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:46.7398913Z 2025-05-07T20:31:46.7399052Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:46.7399156Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:46.7399263Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:46.7399387Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:46.7399534Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:46.7399610Z 2025-05-07T20:31:46.7399711Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:46.7399715Z 2025-05-07T20:31:46.7399819Z moe/activation_test.py:126: 2025-05-07T20:31:46.7399950Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.7400056Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:46.7400198Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:46.7400767Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:46.7400878Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:46.7401244Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:46.7401465Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:46.7401843Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:46.7402101Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:46.7402503Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:46.7402762Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:46.7403388Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:46.7403564Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:46.7403906Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:46.7403985Z fn() 2025-05-07T20:31:46.7404397Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:46.7404483Z self.fn.run( 2025-05-07T20:31:46.7404828Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:46.7404929Z kernel = self.compile( 2025-05-07T20:31:46.7405315Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:46.7405499Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:46.7405639Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.7405643Z 2025-05-07T20:31:46.7405854Z self = 2025-05-07T20:31:46.7406682Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:46.7407185Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c5663a3b0>} 2025-05-07T20:31:46.7407933Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:46.7408131Z context = 2025-05-07T20:31:46.7408136Z 2025-05-07T20:31:46.7408313Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:46.7408580Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:46.7408687Z module_map=module_map) 2025-05-07T20:31:46.7408856Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:46.7408963Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:46.7409042Z E ^ 2025-05-07T20:31:46.7409406Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:46.7409411Z 2025-05-07T20:31:46.7409824Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:46.7409829Z 2025-05-07T20:31:46.7409940Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:46.7410167Z self=, 2025-05-07T20:31:46.7410252Z T=1, 2025-05-07T20:31:46.7410338Z D=5120, 2025-05-07T20:31:46.7410424Z scale_ub=1200.0, 2025-05-07T20:31:46.7410513Z contiguous=True, 2025-05-07T20:31:46.7410608Z compiled=True, 2025-05-07T20:31:46.7410685Z ) 2025-05-07T20:31:46.7410905Z self = 2025-05-07T20:31:46.7411081Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:46.7411086Z 2025-05-07T20:31:46.7411165Z @given( 2025-05-07T20:31:46.7411293Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:46.7411395Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:46.7411512Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:46.7411640Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:46.7411757Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:46.7411927Z ) 2025-05-07T20:31:46.7412251Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:46.7412348Z def test_silu_mul_quant( 2025-05-07T20:31:46.7412431Z self, 2025-05-07T20:31:46.7412509Z T: int, 2025-05-07T20:31:46.7412586Z D: int, 2025-05-07T20:31:46.7412690Z scale_ub: Optional[float], 2025-05-07T20:31:46.7412783Z contiguous: bool, 2025-05-07T20:31:46.7412871Z compiled: bool, 2025-05-07T20:31:46.7412959Z ) -> None: 2025-05-07T20:31:46.7413057Z torch.manual_seed(2025) 2025-05-07T20:31:46.7413131Z 2025-05-07T20:31:46.7413307Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:46.7413383Z 2025-05-07T20:31:46.7413477Z x_sign = torch.sign(x) 2025-05-07T20:31:46.7413611Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:46.7413703Z x = x_sign * x_clamp 2025-05-07T20:31:46.7413799Z x0 = x[:, :D] 2025-05-07T20:31:46.7413882Z x1 = x[:, D:] 2025-05-07T20:31:46.7413959Z 2025-05-07T20:31:46.7414055Z if contiguous: 2025-05-07T20:31:46.7414147Z x0 = x0.contiguous() 2025-05-07T20:31:46.7414244Z x1 = x1.contiguous() 2025-05-07T20:31:46.7414319Z 2025-05-07T20:31:46.7414411Z if scale_ub is not None: 2025-05-07T20:31:46.7414523Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:46.7414661Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:46.7414740Z ) 2025-05-07T20:31:46.7414827Z else: 2025-05-07T20:31:46.7414926Z scale_ub_tensor = None 2025-05-07T20:31:46.7415003Z 2025-05-07T20:31:46.7415141Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:46.7415233Z op = silu_mul_quant 2025-05-07T20:31:46.7415320Z if compiled: 2025-05-07T20:31:46.7415428Z op = torch.compile(op) 2025-05-07T20:31:46.7415543Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.7415626Z 2025-05-07T20:31:46.7415726Z > y_fp8, y_scale = fn() 2025-05-07T20:31:46.7415731Z 2025-05-07T20:31:46.7415832Z moe/activation_test.py:117: 2025-05-07T20:31:46.7415973Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.7416079Z moe/activation_test.py:115: in fn 2025-05-07T20:31:46.7416183Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.7416558Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:46.7416656Z return fn(*args, **kwargs) 2025-05-07T20:31:46.7417166Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:46.7417268Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:46.7417627Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:46.7417871Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:46.7418217Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:46.7418313Z kernel = self.compile( 2025-05-07T20:31:46.7418707Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:46.7418885Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:46.7419020Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.7419024Z 2025-05-07T20:31:46.7419230Z self = 2025-05-07T20:31:46.7420069Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:46.7420763Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c5524feb0>} 2025-05-07T20:31:46.7421518Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:46.7421718Z context = 2025-05-07T20:31:46.7421722Z 2025-05-07T20:31:46.7421892Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:46.7422165Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:46.7422272Z module_map=module_map) 2025-05-07T20:31:46.7422437Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:46.7422552Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:46.7422631Z E ^ 2025-05-07T20:31:46.7422995Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:46.7423000Z 2025-05-07T20:31:46.7423420Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:46.7423424Z 2025-05-07T20:31:46.7423531Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:46.7423758Z self=, 2025-05-07T20:31:46.7423838Z T=1, 2025-05-07T20:31:46.7423916Z D=5120, 2025-05-07T20:31:46.7424005Z scale_ub=None, 2025-05-07T20:31:46.7424094Z contiguous=False, 2025-05-07T20:31:46.7424181Z compiled=True, 2025-05-07T20:31:46.7424263Z ) 2025-05-07T20:31:46.7424480Z self = 2025-05-07T20:31:46.7424652Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:46.7424662Z 2025-05-07T20:31:46.7424747Z @given( 2025-05-07T20:31:46.7424866Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:46.7424973Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:46.7425087Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:46.7425206Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:46.7425327Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:46.7425405Z ) 2025-05-07T20:31:46.7425657Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:46.7425759Z def test_silu_mul_quant( 2025-05-07T20:31:46.7425837Z self, 2025-05-07T20:31:46.7425918Z T: int, 2025-05-07T20:31:46.7426004Z D: int, 2025-05-07T20:31:46.7426105Z scale_ub: Optional[float], 2025-05-07T20:31:46.7426207Z contiguous: bool, 2025-05-07T20:31:46.7426300Z compiled: bool, 2025-05-07T20:31:46.7426382Z ) -> None: 2025-05-07T20:31:46.7426490Z torch.manual_seed(2025) 2025-05-07T20:31:46.7426566Z 2025-05-07T20:31:46.7426737Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:46.7426821Z 2025-05-07T20:31:46.7426916Z x_sign = torch.sign(x) 2025-05-07T20:31:46.7427041Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:46.7427139Z x = x_sign * x_clamp 2025-05-07T20:31:46.7427223Z x0 = x[:, :D] 2025-05-07T20:31:46.7427305Z x1 = x[:, D:] 2025-05-07T20:31:46.7427387Z 2025-05-07T20:31:46.7427473Z if contiguous: 2025-05-07T20:31:46.7427568Z x0 = x0.contiguous() 2025-05-07T20:31:46.7427664Z x1 = x1.contiguous() 2025-05-07T20:31:46.7427739Z 2025-05-07T20:31:46.7427839Z if scale_ub is not None: 2025-05-07T20:31:46.7427946Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:46.7428168Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:46.7428255Z ) 2025-05-07T20:31:46.7428412Z else: 2025-05-07T20:31:46.7428511Z scale_ub_tensor = None 2025-05-07T20:31:46.7428593Z 2025-05-07T20:31:46.7428725Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:46.7428817Z op = silu_mul_quant 2025-05-07T20:31:46.7428911Z if compiled: 2025-05-07T20:31:46.7429014Z op = torch.compile(op) 2025-05-07T20:31:46.7429123Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.7429205Z 2025-05-07T20:31:46.7429298Z y_fp8, y_scale = fn() 2025-05-07T20:31:46.7429430Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:46.7429504Z 2025-05-07T20:31:46.7429642Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:46.7429752Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:46.7429853Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:46.7429983Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:46.7430137Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:46.7430216Z 2025-05-07T20:31:46.7430317Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:46.7430322Z 2025-05-07T20:31:46.7430431Z moe/activation_test.py:126: 2025-05-07T20:31:46.7430561Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.7430673Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:46.7430808Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:46.7431367Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:46.7431477Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:46.7431836Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:46.7432068Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:46.7432447Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:46.7432702Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:46.7433113Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:46.7433365Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:46.7433746Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:46.7433924Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:46.7434271Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:46.7434362Z fn() 2025-05-07T20:31:46.7434769Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:46.7434856Z self.fn.run( 2025-05-07T20:31:46.7435199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:46.7435296Z kernel = self.compile( 2025-05-07T20:31:46.7435674Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:46.7435856Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:46.7435986Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.7435990Z 2025-05-07T20:31:46.7436201Z self = 2025-05-07T20:31:46.7437059Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:46.7437646Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c564de290>} 2025-05-07T20:31:46.7438388Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:46.7438582Z context = 2025-05-07T20:31:46.7438587Z 2025-05-07T20:31:46.7438762Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:46.7439024Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:46.7439144Z module_map=module_map) 2025-05-07T20:31:46.7439309Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:46.7439418Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:46.7439505Z E ^ 2025-05-07T20:31:46.7439859Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:46.7439864Z 2025-05-07T20:31:46.7440276Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:46.7440287Z 2025-05-07T20:31:46.7440393Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:46.7440615Z self=, 2025-05-07T20:31:46.7440702Z T=1, 2025-05-07T20:31:46.7440782Z D=5120, 2025-05-07T20:31:46.7440866Z scale_ub=None, 2025-05-07T20:31:46.7440958Z contiguous=True, 2025-05-07T20:31:46.7441044Z compiled=False, 2025-05-07T20:31:46.7441125Z ) 2025-05-07T20:31:46.7441349Z self = 2025-05-07T20:31:46.7441518Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:46.7441523Z 2025-05-07T20:31:46.7441602Z @given( 2025-05-07T20:31:46.7441730Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:46.7441833Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:46.7441958Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:46.7442076Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:46.7442191Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:46.7442274Z ) 2025-05-07T20:31:46.7442525Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:46.7442622Z def test_silu_mul_quant( 2025-05-07T20:31:46.7442708Z self, 2025-05-07T20:31:46.7442787Z T: int, 2025-05-07T20:31:46.7442865Z D: int, 2025-05-07T20:31:46.7442978Z scale_ub: Optional[float], 2025-05-07T20:31:46.7443071Z contiguous: bool, 2025-05-07T20:31:46.7443173Z compiled: bool, 2025-05-07T20:31:46.7443254Z ) -> None: 2025-05-07T20:31:46.7443349Z torch.manual_seed(2025) 2025-05-07T20:31:46.7443430Z 2025-05-07T20:31:46.7443601Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:46.7443678Z 2025-05-07T20:31:46.7443776Z x_sign = torch.sign(x) 2025-05-07T20:31:46.7443903Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:46.7443993Z x = x_sign * x_clamp 2025-05-07T20:31:46.7444080Z x0 = x[:, :D] 2025-05-07T20:31:46.7444162Z x1 = x[:, D:] 2025-05-07T20:31:46.7444237Z 2025-05-07T20:31:46.7444329Z if contiguous: 2025-05-07T20:31:46.7444425Z x0 = x0.contiguous() 2025-05-07T20:31:46.7444516Z x1 = x1.contiguous() 2025-05-07T20:31:46.7444597Z 2025-05-07T20:31:46.7444777Z if scale_ub is not None: 2025-05-07T20:31:46.7444892Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:46.7445130Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:46.7445212Z ) 2025-05-07T20:31:46.7445296Z else: 2025-05-07T20:31:46.7445393Z scale_ub_tensor = None 2025-05-07T20:31:46.7445468Z 2025-05-07T20:31:46.7445606Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:46.7445698Z op = silu_mul_quant 2025-05-07T20:31:46.7445784Z if compiled: 2025-05-07T20:31:46.7445895Z op = torch.compile(op) 2025-05-07T20:31:46.7446003Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.7446080Z 2025-05-07T20:31:46.7446182Z > y_fp8, y_scale = fn() 2025-05-07T20:31:46.7446187Z 2025-05-07T20:31:46.7446287Z moe/activation_test.py:117: 2025-05-07T20:31:46.7446424Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.7446534Z moe/activation_test.py:115: in fn 2025-05-07T20:31:46.7446634Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.7447152Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:46.7447257Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:46.7447621Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:46.7447849Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:46.7448191Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:46.7448294Z kernel = self.compile( 2025-05-07T20:31:46.7448675Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:46.7448856Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:46.7448996Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.7449004Z 2025-05-07T20:31:46.7449213Z self = 2025-05-07T20:31:46.7449988Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:46.7450486Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c5524fb50>} 2025-05-07T20:31:46.7451227Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:46.7451426Z context = 2025-05-07T20:31:46.7451434Z 2025-05-07T20:31:46.7451608Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:46.7451883Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:46.7451993Z module_map=module_map) 2025-05-07T20:31:46.7452157Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:46.7452272Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:46.7452353Z E ^ 2025-05-07T20:31:46.7452713Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:46.7452718Z 2025-05-07T20:31:46.7453136Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:46.7453141Z 2025-05-07T20:31:46.7453246Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:46.7453560Z self=, 2025-05-07T20:31:46.7453640Z T=128, 2025-05-07T20:31:46.7453791Z D=5120, 2025-05-07T20:31:46.7453881Z scale_ub=None, 2025-05-07T20:31:46.7453967Z contiguous=False, 2025-05-07T20:31:46.7454078Z compiled=True, 2025-05-07T20:31:46.7454157Z ) 2025-05-07T20:31:46.7454377Z self = 2025-05-07T20:31:46.7454554Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:46.7454558Z 2025-05-07T20:31:46.7460426Z @given( 2025-05-07T20:31:46.7460583Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:46.7460689Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:46.7460808Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:46.7460935Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:46.7461051Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:46.7461144Z ) 2025-05-07T20:31:46.7461402Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:46.7461508Z def test_silu_mul_quant( 2025-05-07T20:31:46.7461590Z self, 2025-05-07T20:31:46.7461677Z T: int, 2025-05-07T20:31:46.7461757Z D: int, 2025-05-07T20:31:46.7461860Z scale_ub: Optional[float], 2025-05-07T20:31:46.7461962Z contiguous: bool, 2025-05-07T20:31:46.7462051Z compiled: bool, 2025-05-07T20:31:46.7462147Z ) -> None: 2025-05-07T20:31:46.7462245Z torch.manual_seed(2025) 2025-05-07T20:31:46.7462322Z 2025-05-07T20:31:46.7462505Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:46.7462585Z 2025-05-07T20:31:46.7462681Z x_sign = torch.sign(x) 2025-05-07T20:31:46.7462815Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:46.7462909Z x = x_sign * x_clamp 2025-05-07T20:31:46.7462993Z x0 = x[:, :D] 2025-05-07T20:31:46.7463094Z x1 = x[:, D:] 2025-05-07T20:31:46.7463169Z 2025-05-07T20:31:46.7463258Z if contiguous: 2025-05-07T20:31:46.7463366Z x0 = x0.contiguous() 2025-05-07T20:31:46.7463458Z x1 = x1.contiguous() 2025-05-07T20:31:46.7463545Z 2025-05-07T20:31:46.7463644Z if scale_ub is not None: 2025-05-07T20:31:46.7463752Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:46.7463900Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:46.7463981Z ) 2025-05-07T20:31:46.7464060Z else: 2025-05-07T20:31:46.7464166Z scale_ub_tensor = None 2025-05-07T20:31:46.7464243Z 2025-05-07T20:31:46.7464377Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:46.7464481Z op = silu_mul_quant 2025-05-07T20:31:46.7464569Z if compiled: 2025-05-07T20:31:46.7464673Z op = torch.compile(op) 2025-05-07T20:31:46.7464791Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.7464872Z 2025-05-07T20:31:46.7464969Z > y_fp8, y_scale = fn() 2025-05-07T20:31:46.7464987Z 2025-05-07T20:31:46.7465092Z moe/activation_test.py:117: 2025-05-07T20:31:46.7465225Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.7465338Z moe/activation_test.py:115: in fn 2025-05-07T20:31:46.7465442Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.7465816Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:46.7465923Z return fn(*args, **kwargs) 2025-05-07T20:31:46.7466417Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:46.7466526Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:46.7466892Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:46.7467246Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:46.7467667Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:46.7467767Z kernel = self.compile( 2025-05-07T20:31:46.7468150Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:46.7468339Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:46.7468468Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.7468473Z 2025-05-07T20:31:46.7468685Z self = 2025-05-07T20:31:46.7469459Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:46.7469977Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c557bdc60>} 2025-05-07T20:31:46.7470727Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:46.7470919Z context = 2025-05-07T20:31:46.7470925Z 2025-05-07T20:31:46.7471098Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:46.7471367Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:46.7471481Z module_map=module_map) 2025-05-07T20:31:46.7471653Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:46.7471760Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:46.7471848Z E ^ 2025-05-07T20:31:46.7472205Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:46.7472209Z 2025-05-07T20:31:46.7472627Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:46.7472632Z 2025-05-07T20:31:46.7472744Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:46.7472970Z self=, 2025-05-07T20:31:46.7473057Z T=128, 2025-05-07T20:31:46.7473138Z D=7168, 2025-05-07T20:31:46.7473223Z scale_ub=1200.0, 2025-05-07T20:31:46.7473317Z contiguous=False, 2025-05-07T20:31:46.7473404Z compiled=False, 2025-05-07T20:31:46.7473489Z ) 2025-05-07T20:31:46.7473711Z self = 2025-05-07T20:31:46.7473895Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:46.7473900Z 2025-05-07T20:31:46.7473980Z @given( 2025-05-07T20:31:46.7474115Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:46.7474216Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:46.7474341Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:46.7474459Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:46.7474573Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:46.7474658Z ) 2025-05-07T20:31:46.7474909Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:46.7475007Z def test_silu_mul_quant( 2025-05-07T20:31:46.7475094Z self, 2025-05-07T20:31:46.7475174Z T: int, 2025-05-07T20:31:46.7475254Z D: int, 2025-05-07T20:31:46.7475363Z scale_ub: Optional[float], 2025-05-07T20:31:46.7475456Z contiguous: bool, 2025-05-07T20:31:46.7475630Z compiled: bool, 2025-05-07T20:31:46.7475721Z ) -> None: 2025-05-07T20:31:46.7475821Z torch.manual_seed(2025) 2025-05-07T20:31:46.7475983Z 2025-05-07T20:31:46.7476158Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:46.7476237Z 2025-05-07T20:31:46.7476340Z x_sign = torch.sign(x) 2025-05-07T20:31:46.7476466Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:46.7476560Z x = x_sign * x_clamp 2025-05-07T20:31:46.7476651Z x0 = x[:, :D] 2025-05-07T20:31:46.7476733Z x1 = x[:, D:] 2025-05-07T20:31:46.7476811Z 2025-05-07T20:31:46.7476911Z if contiguous: 2025-05-07T20:31:46.7477010Z x0 = x0.contiguous() 2025-05-07T20:31:46.7477107Z x1 = x1.contiguous() 2025-05-07T20:31:46.7477191Z 2025-05-07T20:31:46.7477286Z if scale_ub is not None: 2025-05-07T20:31:46.7477396Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:46.7477549Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:46.7477629Z ) 2025-05-07T20:31:46.7477718Z else: 2025-05-07T20:31:46.7477823Z scale_ub_tensor = None 2025-05-07T20:31:46.7477900Z 2025-05-07T20:31:46.7478039Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:46.7478132Z op = silu_mul_quant 2025-05-07T20:31:46.7478220Z if compiled: 2025-05-07T20:31:46.7478331Z op = torch.compile(op) 2025-05-07T20:31:46.7478440Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.7478516Z 2025-05-07T20:31:46.7478619Z > y_fp8, y_scale = fn() 2025-05-07T20:31:46.7478624Z 2025-05-07T20:31:46.7478724Z moe/activation_test.py:117: 2025-05-07T20:31:46.7478861Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.7478970Z moe/activation_test.py:115: in fn 2025-05-07T20:31:46.7479072Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.7479594Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:46.7479697Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:46.7480056Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:46.7480287Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:46.7480629Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:46.7480736Z kernel = self.compile( 2025-05-07T20:31:46.7481123Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:46.7481302Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:46.7481428Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.7481437Z 2025-05-07T20:31:46.7481655Z self = 2025-05-07T20:31:46.7482429Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:46.7482943Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c557bf9a0>} 2025-05-07T20:31:46.7483685Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:46.7483878Z context = 2025-05-07T20:31:46.7483883Z 2025-05-07T20:31:46.7484058Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:46.7484511Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:46.7484629Z module_map=module_map) 2025-05-07T20:31:46.7484795Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:46.7484898Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:46.7484982Z E ^ 2025-05-07T20:31:46.7485338Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:46.7485343Z 2025-05-07T20:31:46.7485756Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:46.7485768Z 2025-05-07T20:31:46.7485873Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:46.7486093Z self=, 2025-05-07T20:31:46.7486180Z T=128, 2025-05-07T20:31:46.7486263Z D=5120, 2025-05-07T20:31:46.7486348Z scale_ub=None, 2025-05-07T20:31:46.7486445Z contiguous=False, 2025-05-07T20:31:46.7486557Z compiled=False, 2025-05-07T20:31:46.7486637Z ) 2025-05-07T20:31:46.7486885Z self = 2025-05-07T20:31:46.7487058Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:46.7487063Z 2025-05-07T20:31:46.7487151Z @given( 2025-05-07T20:31:46.7487269Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:46.7487370Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:46.7487495Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:46.7487612Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:46.7487729Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:46.7487812Z ) 2025-05-07T20:31:46.7488058Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:46.7488160Z def test_silu_mul_quant( 2025-05-07T20:31:46.7488245Z self, 2025-05-07T20:31:46.7488329Z T: int, 2025-05-07T20:31:46.7488408Z D: int, 2025-05-07T20:31:46.7488519Z scale_ub: Optional[float], 2025-05-07T20:31:46.7488614Z contiguous: bool, 2025-05-07T20:31:46.7488707Z compiled: bool, 2025-05-07T20:31:46.7488788Z ) -> None: 2025-05-07T20:31:46.7488885Z torch.manual_seed(2025) 2025-05-07T20:31:46.7488967Z 2025-05-07T20:31:46.7489136Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:46.7489214Z 2025-05-07T20:31:46.7489313Z x_sign = torch.sign(x) 2025-05-07T20:31:46.7489439Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:46.7489531Z x = x_sign * x_clamp 2025-05-07T20:31:46.7489620Z x0 = x[:, :D] 2025-05-07T20:31:46.7489703Z x1 = x[:, D:] 2025-05-07T20:31:46.7489780Z 2025-05-07T20:31:46.7490232Z if contiguous: 2025-05-07T20:31:46.7490365Z x0 = x0.contiguous() 2025-05-07T20:31:46.7490502Z x1 = x1.contiguous() 2025-05-07T20:31:46.7490585Z 2025-05-07T20:31:46.7490678Z if scale_ub is not None: 2025-05-07T20:31:46.7490792Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:46.7490930Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:46.7491009Z ) 2025-05-07T20:31:46.7491094Z else: 2025-05-07T20:31:46.7491191Z scale_ub_tensor = None 2025-05-07T20:31:46.7491271Z 2025-05-07T20:31:46.7491412Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:46.7491506Z op = silu_mul_quant 2025-05-07T20:31:46.7491594Z if compiled: 2025-05-07T20:31:46.7491703Z op = torch.compile(op) 2025-05-07T20:31:46.7491812Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.7491894Z 2025-05-07T20:31:46.7491989Z > y_fp8, y_scale = fn() 2025-05-07T20:31:46.7492218Z 2025-05-07T20:31:46.7492321Z moe/activation_test.py:117: 2025-05-07T20:31:46.7492573Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.7492681Z moe/activation_test.py:115: in fn 2025-05-07T20:31:46.7492784Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.7493296Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:46.7493397Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:46.7493754Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:46.7493981Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:46.7494321Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:46.7494423Z kernel = self.compile( 2025-05-07T20:31:46.7494818Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:46.7495001Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:46.7495134Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.7495139Z 2025-05-07T20:31:46.7495345Z self = 2025-05-07T20:31:46.7496117Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:46.7496617Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c564df370>} 2025-05-07T20:31:46.7497368Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:46.7497562Z context = 2025-05-07T20:31:46.7497567Z 2025-05-07T20:31:46.7497734Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:46.7498003Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:46.7498113Z module_map=module_map) 2025-05-07T20:31:46.7498276Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:46.7498384Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:46.7498464Z E ^ 2025-05-07T20:31:46.7498822Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:46.7498826Z 2025-05-07T20:31:46.7499237Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:46.7499246Z 2025-05-07T20:31:46.7499359Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:46.7499587Z self=, 2025-05-07T20:31:46.7499668Z T=128, 2025-05-07T20:31:46.7499755Z D=5120, 2025-05-07T20:31:46.7499904Z scale_ub=1200.0, 2025-05-07T20:31:46.7499994Z contiguous=True, 2025-05-07T20:31:46.7500087Z compiled=False, 2025-05-07T20:31:46.7500164Z ) 2025-05-07T20:31:46.7500380Z self = 2025-05-07T20:31:46.7500557Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:46.7500561Z 2025-05-07T20:31:46.7500638Z @given( 2025-05-07T20:31:46.7500759Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:46.7500866Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:46.7501073Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:46.7501201Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:46.7501388Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:46.7501468Z ) 2025-05-07T20:31:46.7501723Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:46.7501820Z def test_silu_mul_quant( 2025-05-07T20:31:46.7501895Z self, 2025-05-07T20:31:46.7501976Z T: int, 2025-05-07T20:31:46.7502053Z D: int, 2025-05-07T20:31:46.7502153Z scale_ub: Optional[float], 2025-05-07T20:31:46.7502249Z contiguous: bool, 2025-05-07T20:31:46.7502335Z compiled: bool, 2025-05-07T20:31:46.7502414Z ) -> None: 2025-05-07T20:31:46.7502517Z torch.manual_seed(2025) 2025-05-07T20:31:46.7502588Z 2025-05-07T20:31:46.7502755Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:46.7502836Z 2025-05-07T20:31:46.7502933Z x_sign = torch.sign(x) 2025-05-07T20:31:46.7503061Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:46.7503157Z x = x_sign * x_clamp 2025-05-07T20:31:46.7503240Z x0 = x[:, :D] 2025-05-07T20:31:46.7503328Z x1 = x[:, D:] 2025-05-07T20:31:46.7503402Z 2025-05-07T20:31:46.7503486Z if contiguous: 2025-05-07T20:31:46.7503585Z x0 = x0.contiguous() 2025-05-07T20:31:46.7503678Z x1 = x1.contiguous() 2025-05-07T20:31:46.7503754Z 2025-05-07T20:31:46.7503856Z if scale_ub is not None: 2025-05-07T20:31:46.7503962Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:46.7504097Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:46.7504180Z ) 2025-05-07T20:31:46.7504256Z else: 2025-05-07T20:31:46.7504357Z scale_ub_tensor = None 2025-05-07T20:31:46.7504433Z 2025-05-07T20:31:46.7504563Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:46.7504668Z op = silu_mul_quant 2025-05-07T20:31:46.7504755Z if compiled: 2025-05-07T20:31:46.7504861Z op = torch.compile(op) 2025-05-07T20:31:46.7504976Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.7505051Z 2025-05-07T20:31:46.7505142Z > y_fp8, y_scale = fn() 2025-05-07T20:31:46.7505146Z 2025-05-07T20:31:46.7505251Z moe/activation_test.py:117: 2025-05-07T20:31:46.7505379Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.7505486Z moe/activation_test.py:115: in fn 2025-05-07T20:31:46.7505587Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.7506087Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:46.7506192Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:46.7506547Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:46.7506774Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:46.7507121Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:46.7507216Z kernel = self.compile( 2025-05-07T20:31:46.7507608Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:46.7507781Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:46.7507906Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.7507911Z 2025-05-07T20:31:46.7508126Z self = 2025-05-07T20:31:46.7508891Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:46.7509562Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c55825000>} 2025-05-07T20:31:46.7510302Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:46.7510492Z context = 2025-05-07T20:31:46.7510504Z 2025-05-07T20:31:46.7510669Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:46.7510928Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:46.7511039Z module_map=module_map) 2025-05-07T20:31:46.7511202Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:46.7511308Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:46.7511393Z E ^ 2025-05-07T20:31:46.7511746Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:46.7511751Z 2025-05-07T20:31:46.7512171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:46.7512176Z 2025-05-07T20:31:46.7512281Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:46.7512499Z self=, 2025-05-07T20:31:46.7512580Z T=1, 2025-05-07T20:31:46.7512657Z D=7168, 2025-05-07T20:31:46.7512741Z scale_ub=1200.0, 2025-05-07T20:31:46.7512832Z contiguous=True, 2025-05-07T20:31:46.7512914Z compiled=True, 2025-05-07T20:31:46.7512987Z ) 2025-05-07T20:31:46.7513207Z self = 2025-05-07T20:31:46.7513376Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:46.7513381Z 2025-05-07T20:31:46.7513468Z @given( 2025-05-07T20:31:46.7513589Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:46.7513688Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:46.7513807Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:46.7513926Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:46.7514042Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:46.7514123Z ) 2025-05-07T20:31:46.7514366Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:46.7514462Z def test_silu_mul_quant( 2025-05-07T20:31:46.7514542Z self, 2025-05-07T20:31:46.7514620Z T: int, 2025-05-07T20:31:46.7514703Z D: int, 2025-05-07T20:31:46.7514802Z scale_ub: Optional[float], 2025-05-07T20:31:46.7514892Z contiguous: bool, 2025-05-07T20:31:46.7514990Z compiled: bool, 2025-05-07T20:31:46.7515070Z ) -> None: 2025-05-07T20:31:46.7515166Z torch.manual_seed(2025) 2025-05-07T20:31:46.7515253Z 2025-05-07T20:31:46.7515423Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:46.7515494Z 2025-05-07T20:31:46.7515598Z x_sign = torch.sign(x) 2025-05-07T20:31:46.7515722Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:46.7515813Z x = x_sign * x_clamp 2025-05-07T20:31:46.7515899Z x0 = x[:, :D] 2025-05-07T20:31:46.7515981Z x1 = x[:, D:] 2025-05-07T20:31:46.7516064Z 2025-05-07T20:31:46.7516149Z if contiguous: 2025-05-07T20:31:46.7516244Z x0 = x0.contiguous() 2025-05-07T20:31:46.7516341Z x1 = x1.contiguous() 2025-05-07T20:31:46.7516426Z 2025-05-07T20:31:46.7516528Z if scale_ub is not None: 2025-05-07T20:31:46.7516660Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:46.7516908Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:46.7516984Z ) 2025-05-07T20:31:46.7517064Z else: 2025-05-07T20:31:46.7517234Z scale_ub_tensor = None 2025-05-07T20:31:46.7517310Z 2025-05-07T20:31:46.7517444Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:46.7517535Z op = silu_mul_quant 2025-05-07T20:31:46.7517622Z if compiled: 2025-05-07T20:31:46.7517730Z op = torch.compile(op) 2025-05-07T20:31:46.7517836Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.7517919Z 2025-05-07T20:31:46.7518012Z > y_fp8, y_scale = fn() 2025-05-07T20:31:46.7518017Z 2025-05-07T20:31:46.7518118Z moe/activation_test.py:117: 2025-05-07T20:31:46.7518255Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.7518357Z moe/activation_test.py:115: in fn 2025-05-07T20:31:46.7518455Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.7518831Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:46.7518933Z return fn(*args, **kwargs) 2025-05-07T20:31:46.7519426Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:46.7519526Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:46.7519879Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:46.7520104Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:46.7520442Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:46.7520536Z kernel = self.compile( 2025-05-07T20:31:46.7520918Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:46.7521100Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:46.7521236Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.7521240Z 2025-05-07T20:31:46.7521445Z self = 2025-05-07T20:31:46.7522214Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:46.7522713Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c55825360>} 2025-05-07T20:31:46.7523449Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:46.7523651Z context = 2025-05-07T20:31:46.7523661Z 2025-05-07T20:31:46.7523827Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:46.7524099Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:46.7524204Z module_map=module_map) 2025-05-07T20:31:46.7524367Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:46.7524471Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:46.7524549Z E ^ 2025-05-07T20:31:46.7524899Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:46.7524903Z 2025-05-07T20:31:46.7525321Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:46.7525326Z 2025-05-07T20:31:46.7525515Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:46.7525812Z self=, 2025-05-07T20:31:46.7525891Z T=1, 2025-05-07T20:31:46.7525968Z D=7168, 2025-05-07T20:31:46.7526057Z scale_ub=1200.0, 2025-05-07T20:31:46.7526145Z contiguous=False, 2025-05-07T20:31:46.7526228Z compiled=True, 2025-05-07T20:31:46.7526310Z ) 2025-05-07T20:31:46.7526523Z self = 2025-05-07T20:31:46.7526689Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:46.7526700Z 2025-05-07T20:31:46.7526778Z @given( 2025-05-07T20:31:46.7526896Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:46.7527000Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:46.7527116Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:46.7527231Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:46.7527354Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:46.7527429Z ) 2025-05-07T20:31:46.7527678Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:46.7527779Z def test_silu_mul_quant( 2025-05-07T20:31:46.7527855Z self, 2025-05-07T20:31:46.7527933Z T: int, 2025-05-07T20:31:46.7528014Z D: int, 2025-05-07T20:31:46.7528116Z scale_ub: Optional[float], 2025-05-07T20:31:46.7528213Z contiguous: bool, 2025-05-07T20:31:46.7528300Z compiled: bool, 2025-05-07T20:31:46.7528378Z ) -> None: 2025-05-07T20:31:46.7528479Z torch.manual_seed(2025) 2025-05-07T20:31:46.7528554Z 2025-05-07T20:31:46.7528722Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:46.7528800Z 2025-05-07T20:31:46.7528892Z x_sign = torch.sign(x) 2025-05-07T20:31:46.7529016Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:46.7529118Z x = x_sign * x_clamp 2025-05-07T20:31:46.7529200Z x0 = x[:, :D] 2025-05-07T20:31:46.7529279Z x1 = x[:, D:] 2025-05-07T20:31:46.7529360Z 2025-05-07T20:31:46.7529445Z if contiguous: 2025-05-07T20:31:46.7529535Z x0 = x0.contiguous() 2025-05-07T20:31:46.7529628Z x1 = x1.contiguous() 2025-05-07T20:31:46.7529701Z 2025-05-07T20:31:46.7529795Z if scale_ub is not None: 2025-05-07T20:31:46.7529900Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:46.7530033Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:46.7530115Z ) 2025-05-07T20:31:46.7530191Z else: 2025-05-07T20:31:46.7530285Z scale_ub_tensor = None 2025-05-07T20:31:46.7530365Z 2025-05-07T20:31:46.7530495Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:46.7530585Z op = silu_mul_quant 2025-05-07T20:31:46.7530677Z if compiled: 2025-05-07T20:31:46.7530780Z op = torch.compile(op) 2025-05-07T20:31:46.7530886Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.7530967Z 2025-05-07T20:31:46.7531067Z > y_fp8, y_scale = fn() 2025-05-07T20:31:46.7531072Z 2025-05-07T20:31:46.7531173Z moe/activation_test.py:117: 2025-05-07T20:31:46.7531301Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.7531400Z moe/activation_test.py:115: in fn 2025-05-07T20:31:46.7531507Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.7531870Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:46.7531966Z return fn(*args, **kwargs) 2025-05-07T20:31:46.7532463Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:46.7532563Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:46.7532928Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:46.7533398Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:46.7533743Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:46.7533848Z kernel = self.compile( 2025-05-07T20:31:46.7534227Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:46.7534407Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:46.7534535Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.7534540Z 2025-05-07T20:31:46.7534744Z self = 2025-05-07T20:31:46.7535517Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:46.7536033Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c55825630>} 2025-05-07T20:31:46.7536780Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:46.7536970Z context = 2025-05-07T20:31:46.7536974Z 2025-05-07T20:31:46.7537139Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:46.7537407Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:46.7537516Z module_map=module_map) 2025-05-07T20:31:46.7537693Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:46.7537795Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:46.7537876Z E ^ 2025-05-07T20:31:46.7538233Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:46.7538238Z 2025-05-07T20:31:46.7538649Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:46.7538653Z 2025-05-07T20:31:46.7538767Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:46.7538988Z self=, 2025-05-07T20:31:46.7539066Z T=1, 2025-05-07T20:31:46.7539147Z D=7168, 2025-05-07T20:31:46.7539232Z scale_ub=None, 2025-05-07T20:31:46.7539324Z contiguous=False, 2025-05-07T20:31:46.7539415Z compiled=True, 2025-05-07T20:31:46.7539494Z ) 2025-05-07T20:31:46.7539708Z self = 2025-05-07T20:31:46.7539961Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:46.7539972Z 2025-05-07T20:31:46.7540050Z @given( 2025-05-07T20:31:46.7540171Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:46.7540276Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:46.7540392Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:46.7540516Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:46.7540632Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:46.7540706Z ) 2025-05-07T20:31:46.7540955Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:46.7541049Z def test_silu_mul_quant( 2025-05-07T20:31:46.7541123Z self, 2025-05-07T20:31:46.7541207Z T: int, 2025-05-07T20:31:46.7541283Z D: int, 2025-05-07T20:31:46.7541385Z scale_ub: Optional[float], 2025-05-07T20:31:46.7541571Z contiguous: bool, 2025-05-07T20:31:46.7541657Z compiled: bool, 2025-05-07T20:31:46.7541745Z ) -> None: 2025-05-07T20:31:46.7541913Z torch.manual_seed(2025) 2025-05-07T20:31:46.7541990Z 2025-05-07T20:31:46.7542169Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:46.7542243Z 2025-05-07T20:31:46.7542336Z x_sign = torch.sign(x) 2025-05-07T20:31:46.7542470Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:46.7542561Z x = x_sign * x_clamp 2025-05-07T20:31:46.7542648Z x0 = x[:, :D] 2025-05-07T20:31:46.7542729Z x1 = x[:, D:] 2025-05-07T20:31:46.7542801Z 2025-05-07T20:31:46.7542895Z if contiguous: 2025-05-07T20:31:46.7542989Z x0 = x0.contiguous() 2025-05-07T20:31:46.7543080Z x1 = x1.contiguous() 2025-05-07T20:31:46.7543163Z 2025-05-07T20:31:46.7543256Z if scale_ub is not None: 2025-05-07T20:31:46.7543362Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:46.7543513Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:46.7543598Z ) 2025-05-07T20:31:46.7543685Z else: 2025-05-07T20:31:46.7543784Z scale_ub_tensor = None 2025-05-07T20:31:46.7543861Z 2025-05-07T20:31:46.7544004Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:46.7544099Z op = silu_mul_quant 2025-05-07T20:31:46.7544190Z if compiled: 2025-05-07T20:31:46.7544303Z op = torch.compile(op) 2025-05-07T20:31:46.7544414Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.7544490Z 2025-05-07T20:31:46.7544595Z y_fp8, y_scale = fn() 2025-05-07T20:31:46.7544725Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:46.7544801Z 2025-05-07T20:31:46.7544956Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:46.7545064Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:46.7545181Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:46.7545318Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:46.7545472Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:46.7545556Z 2025-05-07T20:31:46.7545661Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:46.7545665Z 2025-05-07T20:31:46.7545770Z moe/activation_test.py:126: 2025-05-07T20:31:46.7545920Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.7546035Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:46.7546190Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:46.7546883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:46.7546991Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:46.7547435Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:46.7547705Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:46.7548148Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:46.7548456Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:46.7548940Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:46.7549245Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:46.7549702Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:46.7549888Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:46.7550309Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:46.7550527Z fn() 2025-05-07T20:31:46.7551004Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:46.7551090Z self.fn.run( 2025-05-07T20:31:46.7551425Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:46.7551526Z kernel = self.compile( 2025-05-07T20:31:46.7551901Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:46.7552074Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:46.7552208Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.7552212Z 2025-05-07T20:31:46.7552417Z self = 2025-05-07T20:31:46.7553199Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:46.7553709Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c55826440>} 2025-05-07T20:31:46.7554453Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:46.7554643Z context = 2025-05-07T20:31:46.7554648Z 2025-05-07T20:31:46.7554812Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:46.7555079Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:46.7555191Z module_map=module_map) 2025-05-07T20:31:46.7555360Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:46.7555472Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:46.7555548Z E ^ 2025-05-07T20:31:46.7555907Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:46.7555911Z 2025-05-07T20:31:46.7556328Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:46.7556333Z 2025-05-07T20:31:46.7556436Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:46.7556662Z self=, 2025-05-07T20:31:46.7556740Z T=1, 2025-05-07T20:31:46.7556821Z D=5120, 2025-05-07T20:31:46.7556904Z scale_ub=1200.0, 2025-05-07T20:31:46.7556993Z contiguous=False, 2025-05-07T20:31:46.7557088Z compiled=True, 2025-05-07T20:31:46.7557163Z ) 2025-05-07T20:31:46.7557385Z self = 2025-05-07T20:31:46.7557561Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:46.7557566Z 2025-05-07T20:31:46.7557642Z @given( 2025-05-07T20:31:46.7557760Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:46.7557867Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:46.7557981Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:46.7558104Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:46.7558220Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:46.7558295Z ) 2025-05-07T20:31:46.7558546Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:46.7558641Z def test_silu_mul_quant( 2025-05-07T20:31:46.7558720Z self, 2025-05-07T20:31:46.7558805Z T: int, 2025-05-07T20:31:46.7558967Z D: int, 2025-05-07T20:31:46.7559068Z scale_ub: Optional[float], 2025-05-07T20:31:46.7559243Z contiguous: bool, 2025-05-07T20:31:46.7559331Z compiled: bool, 2025-05-07T20:31:46.7559410Z ) -> None: 2025-05-07T20:31:46.7559514Z torch.manual_seed(2025) 2025-05-07T20:31:46.7559590Z 2025-05-07T20:31:46.7559764Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:46.7559839Z 2025-05-07T20:31:46.7559931Z x_sign = torch.sign(x) 2025-05-07T20:31:46.7560061Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:46.7560150Z x = x_sign * x_clamp 2025-05-07T20:31:46.7560232Z x0 = x[:, :D] 2025-05-07T20:31:46.7560320Z x1 = x[:, D:] 2025-05-07T20:31:46.7560395Z 2025-05-07T20:31:46.7560481Z if contiguous: 2025-05-07T20:31:46.7560583Z x0 = x0.contiguous() 2025-05-07T20:31:46.7560674Z x1 = x1.contiguous() 2025-05-07T20:31:46.7560757Z 2025-05-07T20:31:46.7560858Z if scale_ub is not None: 2025-05-07T20:31:46.7560964Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:46.7561107Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:46.7561193Z ) 2025-05-07T20:31:46.7561268Z else: 2025-05-07T20:31:46.7561370Z scale_ub_tensor = None 2025-05-07T20:31:46.7561445Z 2025-05-07T20:31:46.7561575Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:46.7561674Z op = silu_mul_quant 2025-05-07T20:31:46.7561761Z if compiled: 2025-05-07T20:31:46.7561862Z op = torch.compile(op) 2025-05-07T20:31:46.7561981Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.7562053Z 2025-05-07T20:31:46.7562145Z > y_fp8, y_scale = fn() 2025-05-07T20:31:46.7562150Z 2025-05-07T20:31:46.7562256Z moe/activation_test.py:117: 2025-05-07T20:31:46.7562384Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.7562495Z moe/activation_test.py:115: in fn 2025-05-07T20:31:46.7562600Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.7562972Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:46.7563072Z return fn(*args, **kwargs) 2025-05-07T20:31:46.7563561Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:46.7563661Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:46.7564020Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:46.7564243Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:46.7564590Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:46.7564690Z kernel = self.compile( 2025-05-07T20:31:46.7565072Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:46.7565254Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:46.7565379Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.7565384Z 2025-05-07T20:31:46.7565592Z self = 2025-05-07T20:31:46.7566361Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:46.7566864Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c55076170>} 2025-05-07T20:31:46.7567694Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:46.7568037Z context = 2025-05-07T20:31:46.7568043Z 2025-05-07T20:31:46.7568219Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:46.7568481Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:46.7568589Z module_map=module_map) 2025-05-07T20:31:46.7568761Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:46.7568863Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:46.7568944Z E ^ 2025-05-07T20:31:46.7569303Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:46.7569307Z 2025-05-07T20:31:46.7569729Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:46.7569740Z 2025-05-07T20:31:46.7569857Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:46.7570080Z self=, 2025-05-07T20:31:46.7570160Z T=1, 2025-05-07T20:31:46.7570247Z D=5120, 2025-05-07T20:31:46.7570333Z scale_ub=1200.0, 2025-05-07T20:31:46.7570428Z contiguous=False, 2025-05-07T20:31:46.7570515Z compiled=False, 2025-05-07T20:31:46.7570593Z ) 2025-05-07T20:31:46.7570818Z self = 2025-05-07T20:31:46.7570989Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:46.7570993Z 2025-05-07T20:31:46.7571074Z @given( 2025-05-07T20:31:46.7571202Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:46.7571304Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:46.7571426Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:46.7571557Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:46.7571675Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:46.7571760Z ) 2025-05-07T20:31:46.7572011Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:46.7572108Z def test_silu_mul_quant( 2025-05-07T20:31:46.7572194Z self, 2025-05-07T20:31:46.7572273Z T: int, 2025-05-07T20:31:46.7572353Z D: int, 2025-05-07T20:31:46.7572462Z scale_ub: Optional[float], 2025-05-07T20:31:46.7572557Z contiguous: bool, 2025-05-07T20:31:46.7572647Z compiled: bool, 2025-05-07T20:31:46.7572734Z ) -> None: 2025-05-07T20:31:46.7572831Z torch.manual_seed(2025) 2025-05-07T20:31:46.7572907Z 2025-05-07T20:31:46.7573082Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:46.7573166Z 2025-05-07T20:31:46.7573269Z x_sign = torch.sign(x) 2025-05-07T20:31:46.7573395Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:46.7573492Z x = x_sign * x_clamp 2025-05-07T20:31:46.7573584Z x0 = x[:, :D] 2025-05-07T20:31:46.7573668Z x1 = x[:, D:] 2025-05-07T20:31:46.7573745Z 2025-05-07T20:31:46.7573837Z if contiguous: 2025-05-07T20:31:46.7573932Z x0 = x0.contiguous() 2025-05-07T20:31:46.7574026Z x1 = x1.contiguous() 2025-05-07T20:31:46.7574110Z 2025-05-07T20:31:46.7574203Z if scale_ub is not None: 2025-05-07T20:31:46.7574310Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:46.7574453Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:46.7574531Z ) 2025-05-07T20:31:46.7574610Z else: 2025-05-07T20:31:46.7574718Z scale_ub_tensor = None 2025-05-07T20:31:46.7574795Z 2025-05-07T20:31:46.7574933Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:46.7575125Z op = silu_mul_quant 2025-05-07T20:31:46.7575216Z if compiled: 2025-05-07T20:31:46.7575435Z op = torch.compile(op) 2025-05-07T20:31:46.7575549Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.7575626Z 2025-05-07T20:31:46.7575726Z > y_fp8, y_scale = fn() 2025-05-07T20:31:46.7575730Z 2025-05-07T20:31:46.7575830Z moe/activation_test.py:117: 2025-05-07T20:31:46.7575962Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.7576071Z moe/activation_test.py:115: in fn 2025-05-07T20:31:46.7576172Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.7576678Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:46.7576780Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:46.7577140Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:46.7577381Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:46.7577726Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:46.7577825Z kernel = self.compile( 2025-05-07T20:31:46.7578214Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:46.7578388Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:46.7578527Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.7578531Z 2025-05-07T20:31:46.7578734Z self = 2025-05-07T20:31:46.7579502Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:46.7580132Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c55075e10>} 2025-05-07T20:31:46.7580876Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:46.7581075Z context = 2025-05-07T20:31:46.7581079Z 2025-05-07T20:31:46.7581245Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:46.7581514Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:46.7581624Z module_map=module_map) 2025-05-07T20:31:46.7581789Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:46.7581904Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:46.7581986Z E ^ 2025-05-07T20:31:46.7582344Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:46.7582349Z 2025-05-07T20:31:46.7582767Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:46.7582772Z 2025-05-07T20:31:46.7582878Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:46.7583110Z self=, 2025-05-07T20:31:46.7583190Z T=16384, 2025-05-07T20:31:46.7583294Z D=5120, 2025-05-07T20:31:46.7583380Z scale_ub=1200.0, 2025-05-07T20:31:46.7583471Z contiguous=False, 2025-05-07T20:31:46.7583566Z compiled=True, 2025-05-07T20:31:46.7583645Z ) 2025-05-07T20:31:46.7589610Z self = 2025-05-07T20:31:46.7590323Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:46.7590572Z 2025-05-07T20:31:46.7590666Z @given( 2025-05-07T20:31:46.7590798Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:46.7590910Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:46.7591033Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:46.7591152Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:46.7591279Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:46.7591359Z ) 2025-05-07T20:31:46.7591624Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:46.7591728Z def test_silu_mul_quant( 2025-05-07T20:31:46.7591815Z self, 2025-05-07T20:31:46.7591906Z T: int, 2025-05-07T20:31:46.7591987Z D: int, 2025-05-07T20:31:46.7592091Z scale_ub: Optional[float], 2025-05-07T20:31:46.7592202Z contiguous: bool, 2025-05-07T20:31:46.7592292Z compiled: bool, 2025-05-07T20:31:46.7592380Z ) -> None: 2025-05-07T20:31:46.7592492Z torch.manual_seed(2025) 2025-05-07T20:31:46.7592572Z 2025-05-07T20:31:46.7592747Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:46.7592833Z 2025-05-07T20:31:46.7592931Z x_sign = torch.sign(x) 2025-05-07T20:31:46.7593069Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:46.7593163Z x = x_sign * x_clamp 2025-05-07T20:31:46.7593249Z x0 = x[:, :D] 2025-05-07T20:31:46.7593342Z x1 = x[:, D:] 2025-05-07T20:31:46.7593422Z 2025-05-07T20:31:46.7593513Z if contiguous: 2025-05-07T20:31:46.7593620Z x0 = x0.contiguous() 2025-05-07T20:31:46.7593714Z x1 = x1.contiguous() 2025-05-07T20:31:46.7593792Z 2025-05-07T20:31:46.7593899Z if scale_ub is not None: 2025-05-07T20:31:46.7594008Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:46.7594152Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:46.7594243Z ) 2025-05-07T20:31:46.7594330Z else: 2025-05-07T20:31:46.7594437Z scale_ub_tensor = None 2025-05-07T20:31:46.7594516Z 2025-05-07T20:31:46.7594648Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:46.7594750Z op = silu_mul_quant 2025-05-07T20:31:46.7594842Z if compiled: 2025-05-07T20:31:46.7594946Z op = torch.compile(op) 2025-05-07T20:31:46.7595064Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.7595142Z 2025-05-07T20:31:46.7595237Z > y_fp8, y_scale = fn() 2025-05-07T20:31:46.7595242Z 2025-05-07T20:31:46.7595352Z moe/activation_test.py:117: 2025-05-07T20:31:46.7595482Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.7595594Z moe/activation_test.py:115: in fn 2025-05-07T20:31:46.7595702Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.7596079Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:46.7596187Z return fn(*args, **kwargs) 2025-05-07T20:31:46.7596683Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:46.7596788Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:46.7597154Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:46.7597379Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:46.7597728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:46.7597828Z kernel = self.compile( 2025-05-07T20:31:46.7598213Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:46.7598540Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:46.7598745Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.7598751Z 2025-05-07T20:31:46.7598962Z self = 2025-05-07T20:31:46.7599745Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:46.7600247Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c550743a0>} 2025-05-07T20:31:46.7601001Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:46.7601201Z context = 2025-05-07T20:31:46.7601206Z 2025-05-07T20:31:46.7601384Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:46.7601649Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:46.7601762Z module_map=module_map) 2025-05-07T20:31:46.7601934Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:46.7602037Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:46.7602118Z E ^ 2025-05-07T20:31:46.7602487Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:46.7602492Z 2025-05-07T20:31:46.7602913Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:46.7602921Z 2025-05-07T20:31:46.7603037Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:46.7603265Z self=, 2025-05-07T20:31:46.7603349Z T=2048, 2025-05-07T20:31:46.7603438Z D=7168, 2025-05-07T20:31:46.7603525Z scale_ub=1200.0, 2025-05-07T20:31:46.7603613Z contiguous=False, 2025-05-07T20:31:46.7603710Z compiled=True, 2025-05-07T20:31:46.7603791Z ) 2025-05-07T20:31:46.7604017Z self = 2025-05-07T20:31:46.7604194Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:46.7604198Z 2025-05-07T20:31:46.7604278Z @given( 2025-05-07T20:31:46.7604410Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:46.7604516Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:46.7604633Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:46.7604764Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:46.7604884Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:46.7604969Z ) 2025-05-07T20:31:46.7605228Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:46.7605326Z def test_silu_mul_quant( 2025-05-07T20:31:46.7605417Z self, 2025-05-07T20:31:46.7605497Z T: int, 2025-05-07T20:31:46.7605578Z D: int, 2025-05-07T20:31:46.7605690Z scale_ub: Optional[float], 2025-05-07T20:31:46.7605786Z contiguous: bool, 2025-05-07T20:31:46.7605876Z compiled: bool, 2025-05-07T20:31:46.7605971Z ) -> None: 2025-05-07T20:31:46.7606071Z torch.manual_seed(2025) 2025-05-07T20:31:46.7606152Z 2025-05-07T20:31:46.7606335Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:46.7606414Z 2025-05-07T20:31:46.7606515Z x_sign = torch.sign(x) 2025-05-07T20:31:46.7606676Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:46.7606881Z x = x_sign * x_clamp 2025-05-07T20:31:46.7606976Z x0 = x[:, :D] 2025-05-07T20:31:46.7607144Z x1 = x[:, D:] 2025-05-07T20:31:46.7607225Z 2025-05-07T20:31:46.7607323Z if contiguous: 2025-05-07T20:31:46.7607420Z x0 = x0.contiguous() 2025-05-07T20:31:46.7607513Z x1 = x1.contiguous() 2025-05-07T20:31:46.7607601Z 2025-05-07T20:31:46.7607700Z if scale_ub is not None: 2025-05-07T20:31:46.7607812Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:46.7607960Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:46.7608040Z ) 2025-05-07T20:31:46.7608122Z else: 2025-05-07T20:31:46.7608234Z scale_ub_tensor = None 2025-05-07T20:31:46.7608312Z 2025-05-07T20:31:46.7608455Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:46.7608550Z op = silu_mul_quant 2025-05-07T20:31:46.7608648Z if compiled: 2025-05-07T20:31:46.7608766Z op = torch.compile(op) 2025-05-07T20:31:46.7608881Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.7608960Z 2025-05-07T20:31:46.7609064Z > y_fp8, y_scale = fn() 2025-05-07T20:31:46.7609069Z 2025-05-07T20:31:46.7609173Z moe/activation_test.py:117: 2025-05-07T20:31:46.7609304Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.7609417Z moe/activation_test.py:115: in fn 2025-05-07T20:31:46.7609522Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.7609898Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:46.7609996Z return fn(*args, **kwargs) 2025-05-07T20:31:46.7610492Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:46.7610601Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:46.7610969Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:46.7611204Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:46.7611555Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:46.7611666Z kernel = self.compile( 2025-05-07T20:31:46.7612055Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:46.7612242Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:46.7612373Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.7612377Z 2025-05-07T20:31:46.7612584Z self = 2025-05-07T20:31:46.7613368Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:46.7613874Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c55075fc0>} 2025-05-07T20:31:46.7614622Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:46.7614815Z context = 2025-05-07T20:31:46.7614820Z 2025-05-07T20:31:46.7614987Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:46.7615258Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:46.7615367Z module_map=module_map) 2025-05-07T20:31:46.7615634Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:46.7615849Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:46.7615931Z E ^ 2025-05-07T20:31:46.7616291Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:46.7616296Z 2025-05-07T20:31:46.7616711Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:46.7616715Z 2025-05-07T20:31:46.7616826Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:46.7617054Z self=, 2025-05-07T20:31:46.7617134Z T=1, 2025-05-07T20:31:46.7617219Z D=5120, 2025-05-07T20:31:46.7617306Z scale_ub=None, 2025-05-07T20:31:46.7617398Z contiguous=False, 2025-05-07T20:31:46.7617495Z compiled=False, 2025-05-07T20:31:46.7617580Z ) 2025-05-07T20:31:46.7617798Z self = 2025-05-07T20:31:46.7617980Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:46.7617985Z 2025-05-07T20:31:46.7618065Z @given( 2025-05-07T20:31:46.7618194Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:46.7618295Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:46.7618414Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:46.7618541Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:46.7618657Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:46.7618737Z ) 2025-05-07T20:31:46.7618993Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:46.7619093Z def test_silu_mul_quant( 2025-05-07T20:31:46.7619177Z self, 2025-05-07T20:31:46.7619265Z T: int, 2025-05-07T20:31:46.7619345Z D: int, 2025-05-07T20:31:46.7619454Z scale_ub: Optional[float], 2025-05-07T20:31:46.7619556Z contiguous: bool, 2025-05-07T20:31:46.7619648Z compiled: bool, 2025-05-07T20:31:46.7619743Z ) -> None: 2025-05-07T20:31:46.7619959Z torch.manual_seed(2025) 2025-05-07T20:31:46.7620041Z 2025-05-07T20:31:46.7620220Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:46.7620298Z 2025-05-07T20:31:46.7620394Z x_sign = torch.sign(x) 2025-05-07T20:31:46.7620529Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:46.7620623Z x = x_sign * x_clamp 2025-05-07T20:31:46.7620708Z x0 = x[:, :D] 2025-05-07T20:31:46.7620802Z x1 = x[:, D:] 2025-05-07T20:31:46.7620879Z 2025-05-07T20:31:46.7620968Z if contiguous: 2025-05-07T20:31:46.7621072Z x0 = x0.contiguous() 2025-05-07T20:31:46.7621164Z x1 = x1.contiguous() 2025-05-07T20:31:46.7621240Z 2025-05-07T20:31:46.7621343Z if scale_ub is not None: 2025-05-07T20:31:46.7621456Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:46.7621608Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:46.7621688Z ) 2025-05-07T20:31:46.7621769Z else: 2025-05-07T20:31:46.7621873Z scale_ub_tensor = None 2025-05-07T20:31:46.7621950Z 2025-05-07T20:31:46.7622081Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:46.7622182Z op = silu_mul_quant 2025-05-07T20:31:46.7622271Z if compiled: 2025-05-07T20:31:46.7622378Z op = torch.compile(op) 2025-05-07T20:31:46.7622493Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.7622571Z 2025-05-07T20:31:46.7622664Z > y_fp8, y_scale = fn() 2025-05-07T20:31:46.7622675Z 2025-05-07T20:31:46.7622778Z moe/activation_test.py:117: 2025-05-07T20:31:46.7622906Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.7623109Z moe/activation_test.py:115: in fn 2025-05-07T20:31:46.7623211Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.7623783Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:46.7623890Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:46.7624247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:46.7624475Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:46.7624816Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:46.7624913Z kernel = self.compile( 2025-05-07T20:31:46.7625309Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:46.7625485Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:46.7625618Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.7625623Z 2025-05-07T20:31:46.7625842Z self = 2025-05-07T20:31:46.7626616Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:46.7627131Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c55077490>} 2025-05-07T20:31:46.7627872Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:46.7628070Z context = 2025-05-07T20:31:46.7628078Z 2025-05-07T20:31:46.7628250Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:46.7628511Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:46.7628628Z module_map=module_map) 2025-05-07T20:31:46.7628794Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:46.7628895Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:46.7628981Z E ^ 2025-05-07T20:31:46.7629335Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:46.7629340Z 2025-05-07T20:31:46.7629759Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:46.7629764Z 2025-05-07T20:31:46.7629872Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:46.7630098Z self=, 2025-05-07T20:31:46.7630186Z T=4096, 2025-05-07T20:31:46.7630270Z D=7168, 2025-05-07T20:31:46.7630358Z scale_ub=1200.0, 2025-05-07T20:31:46.7630456Z contiguous=False, 2025-05-07T20:31:46.7630546Z compiled=False, 2025-05-07T20:31:46.7630628Z ) 2025-05-07T20:31:46.7630845Z self = 2025-05-07T20:31:46.7631024Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:46.7631028Z 2025-05-07T20:31:46.7631113Z @given( 2025-05-07T20:31:46.7631233Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:46.7631334Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:46.7631455Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:46.7631572Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:46.7631693Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:46.7631855Z ) 2025-05-07T20:31:46.7632104Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:46.7632281Z def test_silu_mul_quant( 2025-05-07T20:31:46.7632363Z self, 2025-05-07T20:31:46.7632442Z T: int, 2025-05-07T20:31:46.7632526Z D: int, 2025-05-07T20:31:46.7632628Z scale_ub: Optional[float], 2025-05-07T20:31:46.7632723Z contiguous: bool, 2025-05-07T20:31:46.7632819Z compiled: bool, 2025-05-07T20:31:46.7632904Z ) -> None: 2025-05-07T20:31:46.7633002Z torch.manual_seed(2025) 2025-05-07T20:31:46.7633085Z 2025-05-07T20:31:46.7633257Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:46.7633333Z 2025-05-07T20:31:46.7633433Z x_sign = torch.sign(x) 2025-05-07T20:31:46.7633558Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:46.7633658Z x = x_sign * x_clamp 2025-05-07T20:31:46.7633743Z x0 = x[:, :D] 2025-05-07T20:31:46.7633832Z x1 = x[:, D:] 2025-05-07T20:31:46.7633916Z 2025-05-07T20:31:46.7634004Z if contiguous: 2025-05-07T20:31:46.7634107Z x0 = x0.contiguous() 2025-05-07T20:31:46.7634207Z x1 = x1.contiguous() 2025-05-07T20:31:46.7634287Z 2025-05-07T20:31:46.7634382Z if scale_ub is not None: 2025-05-07T20:31:46.7634495Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:46.7634631Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:46.7634712Z ) 2025-05-07T20:31:46.7634801Z else: 2025-05-07T20:31:46.7634898Z scale_ub_tensor = None 2025-05-07T20:31:46.7634984Z 2025-05-07T20:31:46.7635116Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:46.7635210Z op = silu_mul_quant 2025-05-07T20:31:46.7635305Z if compiled: 2025-05-07T20:31:46.7635408Z op = torch.compile(op) 2025-05-07T20:31:46.7635516Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.7635606Z 2025-05-07T20:31:46.7635702Z > y_fp8, y_scale = fn() 2025-05-07T20:31:46.7635706Z 2025-05-07T20:31:46.7635816Z moe/activation_test.py:117: 2025-05-07T20:31:46.7635953Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.7636058Z moe/activation_test.py:115: in fn 2025-05-07T20:31:46.7636167Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.7636671Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:46.7636772Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:46.7637136Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:46.7637363Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:46.7637703Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:46.7637817Z kernel = self.compile( 2025-05-07T20:31:46.7638207Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:46.7638390Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:46.7638519Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.7638523Z 2025-05-07T20:31:46.7638728Z self = 2025-05-07T20:31:46.7639507Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:46.7640009Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c54500550>} 2025-05-07T20:31:46.7640922Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:46.7641115Z context = 2025-05-07T20:31:46.7641119Z 2025-05-07T20:31:46.7641293Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:46.7641559Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:46.7641668Z module_map=module_map) 2025-05-07T20:31:46.7641839Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:46.7641942Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:46.7642022Z E ^ 2025-05-07T20:31:46.7642382Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:46.7642395Z 2025-05-07T20:31:46.7642817Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:46.7642822Z 2025-05-07T20:31:46.7642935Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:46.7643159Z self=, 2025-05-07T20:31:46.7643240Z T=16384, 2025-05-07T20:31:46.7643326Z D=7168, 2025-05-07T20:31:46.7643413Z scale_ub=None, 2025-05-07T20:31:46.7643502Z contiguous=True, 2025-05-07T20:31:46.7643595Z compiled=True, 2025-05-07T20:31:46.7643673Z ) 2025-05-07T20:31:46.7643893Z self = 2025-05-07T20:31:46.7644076Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:46.7644080Z 2025-05-07T20:31:46.7644162Z @given( 2025-05-07T20:31:46.7644288Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:46.7644397Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:46.7644524Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:46.7644649Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:46.7644764Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:46.7644842Z ) 2025-05-07T20:31:46.7645099Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:46.7645196Z def test_silu_mul_quant( 2025-05-07T20:31:46.7645276Z self, 2025-05-07T20:31:46.7645362Z T: int, 2025-05-07T20:31:46.7645441Z D: int, 2025-05-07T20:31:46.7645553Z scale_ub: Optional[float], 2025-05-07T20:31:46.7645646Z contiguous: bool, 2025-05-07T20:31:46.7645735Z compiled: bool, 2025-05-07T20:31:46.7645821Z ) -> None: 2025-05-07T20:31:46.7645919Z torch.manual_seed(2025) 2025-05-07T20:31:46.7645994Z 2025-05-07T20:31:46.7646172Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:46.7646253Z 2025-05-07T20:31:46.7646352Z x_sign = torch.sign(x) 2025-05-07T20:31:46.7646489Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:46.7646582Z x = x_sign * x_clamp 2025-05-07T20:31:46.7646665Z x0 = x[:, :D] 2025-05-07T20:31:46.7646757Z x1 = x[:, D:] 2025-05-07T20:31:46.7646832Z 2025-05-07T20:31:46.7646921Z if contiguous: 2025-05-07T20:31:46.7647024Z x0 = x0.contiguous() 2025-05-07T20:31:46.7647115Z x1 = x1.contiguous() 2025-05-07T20:31:46.7647198Z 2025-05-07T20:31:46.7647293Z if scale_ub is not None: 2025-05-07T20:31:46.7647403Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:46.7647548Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:46.7647626Z ) 2025-05-07T20:31:46.7647706Z else: 2025-05-07T20:31:46.7647812Z scale_ub_tensor = None 2025-05-07T20:31:46.7648013Z 2025-05-07T20:31:46.7648147Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:46.7648325Z op = silu_mul_quant 2025-05-07T20:31:46.7648416Z if compiled: 2025-05-07T20:31:46.7648522Z op = torch.compile(op) 2025-05-07T20:31:46.7648641Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.7648721Z 2025-05-07T20:31:46.7648827Z > y_fp8, y_scale = fn() 2025-05-07T20:31:46.7648832Z 2025-05-07T20:31:46.7648934Z moe/activation_test.py:117: 2025-05-07T20:31:46.7649068Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.7649179Z moe/activation_test.py:115: in fn 2025-05-07T20:31:46.7649282Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.7649653Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:46.7649757Z return fn(*args, **kwargs) 2025-05-07T20:31:46.7650255Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:46.7650369Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:46.7650732Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:46.7650959Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:46.7651308Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:46.7651405Z kernel = self.compile( 2025-05-07T20:31:46.7651791Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:46.7651977Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:46.7652105Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.7652116Z 2025-05-07T20:31:46.7652333Z self = 2025-05-07T20:31:46.7653109Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:46.7653625Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c54501360>} 2025-05-07T20:31:46.7654369Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:46.7654561Z context = 2025-05-07T20:31:46.7654566Z 2025-05-07T20:31:46.7654744Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:46.7655016Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:46.7655132Z module_map=module_map) 2025-05-07T20:31:46.7655297Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:46.7655398Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:46.7655485Z E ^ 2025-05-07T20:31:46.7655838Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:46.7655843Z 2025-05-07T20:31:46.7656256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:46.7656267Z 2025-05-07T20:31:46.7656375Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:46.7656597Z self=, 2025-05-07T20:31:46.7656682Z T=4096, 2025-05-07T20:31:46.7656850Z D=5120, 2025-05-07T20:31:46.7656935Z scale_ub=None, 2025-05-07T20:31:46.7657029Z contiguous=False, 2025-05-07T20:31:46.7657188Z compiled=True, 2025-05-07T20:31:46.7657267Z ) 2025-05-07T20:31:46.7657491Z self = 2025-05-07T20:31:46.7657666Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:46.7657671Z 2025-05-07T20:31:46.7657752Z @given( 2025-05-07T20:31:46.7657878Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:46.7657980Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:46.7658105Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:46.7658223Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:46.7658342Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:46.7658426Z ) 2025-05-07T20:31:46.7658675Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:46.7658778Z def test_silu_mul_quant( 2025-05-07T20:31:46.7658865Z self, 2025-05-07T20:31:46.7658945Z T: int, 2025-05-07T20:31:46.7659030Z D: int, 2025-05-07T20:31:46.7659137Z scale_ub: Optional[float], 2025-05-07T20:31:46.7659230Z contiguous: bool, 2025-05-07T20:31:46.7659327Z compiled: bool, 2025-05-07T20:31:46.7659409Z ) -> None: 2025-05-07T20:31:46.7659508Z torch.manual_seed(2025) 2025-05-07T20:31:46.7659590Z 2025-05-07T20:31:46.7659759Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:46.7659938Z 2025-05-07T20:31:46.7660039Z x_sign = torch.sign(x) 2025-05-07T20:31:46.7660166Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:46.7660257Z x = x_sign * x_clamp 2025-05-07T20:31:46.7660346Z x0 = x[:, :D] 2025-05-07T20:31:46.7660429Z x1 = x[:, D:] 2025-05-07T20:31:46.7660510Z 2025-05-07T20:31:46.7660605Z if contiguous: 2025-05-07T20:31:46.7660704Z x0 = x0.contiguous() 2025-05-07T20:31:46.7660796Z x1 = x1.contiguous() 2025-05-07T20:31:46.7660885Z 2025-05-07T20:31:46.7660978Z if scale_ub is not None: 2025-05-07T20:31:46.7661092Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:46.7661229Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:46.7661308Z ) 2025-05-07T20:31:46.7661392Z else: 2025-05-07T20:31:46.7661491Z scale_ub_tensor = None 2025-05-07T20:31:46.7661567Z 2025-05-07T20:31:46.7661703Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:46.7661798Z op = silu_mul_quant 2025-05-07T20:31:46.7661886Z if compiled: 2025-05-07T20:31:46.7661994Z op = torch.compile(op) 2025-05-07T20:31:46.7662102Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.7662178Z 2025-05-07T20:31:46.7662280Z > y_fp8, y_scale = fn() 2025-05-07T20:31:46.7662288Z 2025-05-07T20:31:46.7662391Z moe/activation_test.py:117: 2025-05-07T20:31:46.7662533Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.7662638Z moe/activation_test.py:115: in fn 2025-05-07T20:31:46.7662740Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.7663113Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:46.7663214Z return fn(*args, **kwargs) 2025-05-07T20:31:46.7663706Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:46.7663816Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:46.7664181Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:46.7664411Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:46.7664851Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:46.7665027Z kernel = self.compile( 2025-05-07T20:31:46.7665423Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:46.7665603Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:46.7665739Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.7665744Z 2025-05-07T20:31:46.7665952Z self = 2025-05-07T20:31:46.7666722Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:46.7667228Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c54501ea0>} 2025-05-07T20:31:46.7667978Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:46.7668175Z context = 2025-05-07T20:31:46.7668180Z 2025-05-07T20:31:46.7668349Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:46.7668612Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:46.7668728Z module_map=module_map) 2025-05-07T20:31:46.7668894Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:46.7669002Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:46.7669084Z E ^ 2025-05-07T20:31:46.7669436Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:46.7669445Z 2025-05-07T20:31:46.7669881Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:46.7669886Z 2025-05-07T20:31:46.7669993Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:46.7670226Z self=, 2025-05-07T20:31:46.7670307Z T=4096, 2025-05-07T20:31:46.7670388Z D=5120, 2025-05-07T20:31:46.7670481Z scale_ub=1200.0, 2025-05-07T20:31:46.7670572Z contiguous=False, 2025-05-07T20:31:46.7670663Z compiled=False, 2025-05-07T20:31:46.7670746Z ) 2025-05-07T20:31:46.7670965Z self = 2025-05-07T20:31:46.7671143Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:46.7671148Z 2025-05-07T20:31:46.7671242Z @given( 2025-05-07T20:31:46.7671364Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:46.7671480Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:46.7671598Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:46.7671717Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:46.7671838Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:46.7671916Z ) 2025-05-07T20:31:46.7672162Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:46.7672270Z def test_silu_mul_quant( 2025-05-07T20:31:46.7672350Z self, 2025-05-07T20:31:46.7672429Z T: int, 2025-05-07T20:31:46.7672515Z D: int, 2025-05-07T20:31:46.7672616Z scale_ub: Optional[float], 2025-05-07T20:31:46.7672709Z contiguous: bool, 2025-05-07T20:31:46.7672803Z compiled: bool, 2025-05-07T20:31:46.7672884Z ) -> None: 2025-05-07T20:31:46.7672986Z torch.manual_seed(2025) 2025-05-07T20:31:46.7673160Z 2025-05-07T20:31:46.7673342Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:46.7673594Z 2025-05-07T20:31:46.7673692Z x_sign = torch.sign(x) 2025-05-07T20:31:46.7673826Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:46.7673918Z x = x_sign * x_clamp 2025-05-07T20:31:46.7674001Z x0 = x[:, :D] 2025-05-07T20:31:46.7674091Z x1 = x[:, D:] 2025-05-07T20:31:46.7674167Z 2025-05-07T20:31:46.7674256Z if contiguous: 2025-05-07T20:31:46.7674362Z x0 = x0.contiguous() 2025-05-07T20:31:46.7674456Z x1 = x1.contiguous() 2025-05-07T20:31:46.7674534Z 2025-05-07T20:31:46.7674637Z if scale_ub is not None: 2025-05-07T20:31:46.7674745Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:46.7674883Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:46.7674972Z ) 2025-05-07T20:31:46.7675053Z else: 2025-05-07T20:31:46.7675165Z scale_ub_tensor = None 2025-05-07T20:31:46.7675243Z 2025-05-07T20:31:46.7675378Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:46.7675476Z op = silu_mul_quant 2025-05-07T20:31:46.7675565Z if compiled: 2025-05-07T20:31:46.7675668Z op = torch.compile(op) 2025-05-07T20:31:46.7675784Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.7675860Z 2025-05-07T20:31:46.7675957Z > y_fp8, y_scale = fn() 2025-05-07T20:31:46.7675961Z 2025-05-07T20:31:46.7676069Z moe/activation_test.py:117: 2025-05-07T20:31:46.7676203Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.7676315Z moe/activation_test.py:115: in fn 2025-05-07T20:31:46.7676419Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.7676915Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:46.7677030Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:46.7677392Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:46.7677620Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:46.7677967Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:46.7678063Z kernel = self.compile( 2025-05-07T20:31:46.7678451Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:46.7678628Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:46.7678759Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.7678763Z 2025-05-07T20:31:46.7678974Z self = 2025-05-07T20:31:46.7679753Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:46.7680260Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c54502680>} 2025-05-07T20:31:46.7681000Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:46.7681193Z context = 2025-05-07T20:31:46.7681204Z 2025-05-07T20:31:46.7681373Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:46.7681635Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:46.7681868Z module_map=module_map) 2025-05-07T20:31:46.7682107Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:46.7682211Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:46.7682298Z E ^ 2025-05-07T20:31:46.7682649Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:46.7682654Z 2025-05-07T20:31:46.7683075Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:46.7683080Z 2025-05-07T20:31:46.7683189Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:46.7683411Z self=, 2025-05-07T20:31:46.7683501Z T=4096, 2025-05-07T20:31:46.7683579Z D=5120, 2025-05-07T20:31:46.7683666Z scale_ub=1200.0, 2025-05-07T20:31:46.7683763Z contiguous=False, 2025-05-07T20:31:46.7683855Z compiled=True, 2025-05-07T20:31:46.7683931Z ) 2025-05-07T20:31:46.7684159Z self = 2025-05-07T20:31:46.7684337Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:46.7684342Z 2025-05-07T20:31:46.7684428Z @given( 2025-05-07T20:31:46.7684550Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:46.7684652Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:46.7684776Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:46.7684895Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:46.7685012Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:46.7685097Z ) 2025-05-07T20:31:46.7685346Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:46.7685448Z def test_silu_mul_quant( 2025-05-07T20:31:46.7685531Z self, 2025-05-07T20:31:46.7685616Z T: int, 2025-05-07T20:31:46.7685702Z D: int, 2025-05-07T20:31:46.7685806Z scale_ub: Optional[float], 2025-05-07T20:31:46.7685904Z contiguous: bool, 2025-05-07T20:31:46.7686000Z compiled: bool, 2025-05-07T20:31:46.7686081Z ) -> None: 2025-05-07T20:31:46.7686178Z torch.manual_seed(2025) 2025-05-07T20:31:46.7686263Z 2025-05-07T20:31:46.7686431Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:46.7686509Z 2025-05-07T20:31:46.7686611Z x_sign = torch.sign(x) 2025-05-07T20:31:46.7686737Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:46.7686830Z x = x_sign * x_clamp 2025-05-07T20:31:46.7686919Z x0 = x[:, :D] 2025-05-07T20:31:46.7687003Z x1 = x[:, D:] 2025-05-07T20:31:46.7687086Z 2025-05-07T20:31:46.7687174Z if contiguous: 2025-05-07T20:31:46.7687271Z x0 = x0.contiguous() 2025-05-07T20:31:46.7687372Z x1 = x1.contiguous() 2025-05-07T20:31:46.7687453Z 2025-05-07T20:31:46.7687547Z if scale_ub is not None: 2025-05-07T20:31:46.7687668Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:46.7687806Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:46.7687885Z ) 2025-05-07T20:31:46.7687969Z else: 2025-05-07T20:31:46.7688068Z scale_ub_tensor = None 2025-05-07T20:31:46.7688145Z 2025-05-07T20:31:46.7688282Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:46.7688375Z op = silu_mul_quant 2025-05-07T20:31:46.7688471Z if compiled: 2025-05-07T20:31:46.7688573Z op = torch.compile(op) 2025-05-07T20:31:46.7688682Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.7688767Z 2025-05-07T20:31:46.7688864Z > y_fp8, y_scale = fn() 2025-05-07T20:31:46.7688868Z 2025-05-07T20:31:46.7688970Z moe/activation_test.py:117: 2025-05-07T20:31:46.7689110Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.7689319Z moe/activation_test.py:115: in fn 2025-05-07T20:31:46.7689494Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.7690905Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:46.7691213Z return fn(*args, **kwargs) 2025-05-07T20:31:46.7691762Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:46.7691870Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:46.7692233Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:46.7692466Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:46.7692811Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:46.7692947Z kernel = self.compile( 2025-05-07T20:31:46.7693366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:46.7693552Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:46.7693690Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.7693696Z 2025-05-07T20:31:46.7693907Z self = 2025-05-07T20:31:46.7694682Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:46.7695188Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c54503ac0>} 2025-05-07T20:31:46.7695942Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:46.7696142Z context = 2025-05-07T20:31:46.7696147Z 2025-05-07T20:31:46.7696315Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:46.7696589Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:46.7696699Z module_map=module_map) 2025-05-07T20:31:46.7696864Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:46.7696976Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:46.7697056Z E ^ 2025-05-07T20:31:46.7697416Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:46.7697428Z 2025-05-07T20:31:46.7697855Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:46.7697860Z 2025-05-07T20:31:46.7697964Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:46.7698192Z self=, 2025-05-07T20:31:46.7698269Z T=2048, 2025-05-07T20:31:46.7698348Z D=7168, 2025-05-07T20:31:46.7698437Z scale_ub=1200.0, 2025-05-07T20:31:46.7698521Z contiguous=False, 2025-05-07T20:31:46.7698607Z compiled=False, 2025-05-07T20:31:46.7698694Z ) 2025-05-07T20:31:46.7698910Z self = 2025-05-07T20:31:46.7699086Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:46.7699098Z 2025-05-07T20:31:46.7699175Z @given( 2025-05-07T20:31:46.7699295Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:46.7699736Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:46.7699952Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:46.7700215Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:46.7700336Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:46.7700412Z ) 2025-05-07T20:31:46.7700663Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:46.7700765Z def test_silu_mul_quant( 2025-05-07T20:31:46.7700842Z self, 2025-05-07T20:31:46.7700923Z T: int, 2025-05-07T20:31:46.7700998Z D: int, 2025-05-07T20:31:46.7701098Z scale_ub: Optional[float], 2025-05-07T20:31:46.7701196Z contiguous: bool, 2025-05-07T20:31:46.7701283Z compiled: bool, 2025-05-07T20:31:46.7701362Z ) -> None: 2025-05-07T20:31:46.7701462Z torch.manual_seed(2025) 2025-05-07T20:31:46.7701536Z 2025-05-07T20:31:46.7701709Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:46.7701801Z 2025-05-07T20:31:46.7701892Z x_sign = torch.sign(x) 2025-05-07T20:31:46.7702022Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:46.7702119Z x = x_sign * x_clamp 2025-05-07T20:31:46.7702202Z x0 = x[:, :D] 2025-05-07T20:31:46.7702282Z x1 = x[:, D:] 2025-05-07T20:31:46.7702362Z 2025-05-07T20:31:46.7702448Z if contiguous: 2025-05-07T20:31:46.7702548Z x0 = x0.contiguous() 2025-05-07T20:31:46.7702638Z x1 = x1.contiguous() 2025-05-07T20:31:46.7702711Z 2025-05-07T20:31:46.7702807Z if scale_ub is not None: 2025-05-07T20:31:46.7702916Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:46.7703053Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:46.7703133Z ) 2025-05-07T20:31:46.7703213Z else: 2025-05-07T20:31:46.7703307Z scale_ub_tensor = None 2025-05-07T20:31:46.7703388Z 2025-05-07T20:31:46.7703525Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:46.7703618Z op = silu_mul_quant 2025-05-07T20:31:46.7703713Z if compiled: 2025-05-07T20:31:46.7703813Z op = torch.compile(op) 2025-05-07T20:31:46.7703925Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.7704002Z 2025-05-07T20:31:46.7704096Z > y_fp8, y_scale = fn() 2025-05-07T20:31:46.7704100Z 2025-05-07T20:31:46.7704206Z moe/activation_test.py:117: 2025-05-07T20:31:46.7704337Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.7704441Z moe/activation_test.py:115: in fn 2025-05-07T20:31:46.7704549Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.7705048Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:46.7705149Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:46.7705519Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:46.7705757Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:46.7706104Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:46.7706201Z kernel = self.compile( 2025-05-07T20:31:46.7706583Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:46.7706765Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:46.7706896Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.7706900Z 2025-05-07T20:31:46.7707110Z self = 2025-05-07T20:31:46.7707879Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:46.7708550Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c55826200>} 2025-05-07T20:31:46.7709315Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:46.7709506Z context = 2025-05-07T20:31:46.7709511Z 2025-05-07T20:31:46.7709685Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:46.7709952Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:46.7710059Z module_map=module_map) 2025-05-07T20:31:46.7710238Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:46.7710338Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:46.7710429Z E ^ 2025-05-07T20:31:46.7710781Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:46.7710786Z 2025-05-07T20:31:46.7711204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:46.7711208Z 2025-05-07T20:31:46.7711318Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:46.7711539Z self=, 2025-05-07T20:31:46.7711622Z T=1, 2025-05-07T20:31:46.7711700Z D=7168, 2025-05-07T20:31:46.7711781Z scale_ub=None, 2025-05-07T20:31:46.7711877Z contiguous=True, 2025-05-07T20:31:46.7711964Z compiled=False, 2025-05-07T20:31:46.7712039Z ) 2025-05-07T20:31:46.7712263Z self = 2025-05-07T20:31:46.7712432Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:46.7712443Z 2025-05-07T20:31:46.7712523Z @given( 2025-05-07T20:31:46.7712656Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:46.7712756Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:46.7712880Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:46.7712998Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:46.7713111Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:46.7713195Z ) 2025-05-07T20:31:46.7713439Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:46.7713556Z def test_silu_mul_quant( 2025-05-07T20:31:46.7713631Z self, 2025-05-07T20:31:46.7713705Z T: int, 2025-05-07T20:31:46.7713790Z D: int, 2025-05-07T20:31:46.7713888Z scale_ub: Optional[float], 2025-05-07T20:31:46.7719963Z contiguous: bool, 2025-05-07T20:31:46.7720083Z compiled: bool, 2025-05-07T20:31:46.7720170Z ) -> None: 2025-05-07T20:31:46.7720281Z torch.manual_seed(2025) 2025-05-07T20:31:46.7720368Z 2025-05-07T20:31:46.7720550Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:46.7720632Z 2025-05-07T20:31:46.7720741Z x_sign = torch.sign(x) 2025-05-07T20:31:46.7720872Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:46.7720967Z x = x_sign * x_clamp 2025-05-07T20:31:46.7721058Z x0 = x[:, :D] 2025-05-07T20:31:46.7721141Z x1 = x[:, D:] 2025-05-07T20:31:46.7721226Z 2025-05-07T20:31:46.7721315Z if contiguous: 2025-05-07T20:31:46.7721413Z x0 = x0.contiguous() 2025-05-07T20:31:46.7721515Z x1 = x1.contiguous() 2025-05-07T20:31:46.7721591Z 2025-05-07T20:31:46.7721685Z if scale_ub is not None: 2025-05-07T20:31:46.7721914Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:46.7722055Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:46.7722214Z ) 2025-05-07T20:31:46.7722306Z else: 2025-05-07T20:31:46.7722405Z scale_ub_tensor = None 2025-05-07T20:31:46.7722482Z 2025-05-07T20:31:46.7722625Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:46.7722721Z op = silu_mul_quant 2025-05-07T20:31:46.7722810Z if compiled: 2025-05-07T20:31:46.7722924Z op = torch.compile(op) 2025-05-07T20:31:46.7723034Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.7723120Z 2025-05-07T20:31:46.7723216Z > y_fp8, y_scale = fn() 2025-05-07T20:31:46.7723221Z 2025-05-07T20:31:46.7723324Z moe/activation_test.py:117: 2025-05-07T20:31:46.7723470Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.7723576Z moe/activation_test.py:115: in fn 2025-05-07T20:31:46.7723687Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.7724209Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:46.7724314Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:46.7724685Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:46.7724914Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:46.7725265Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:46.7725376Z kernel = self.compile( 2025-05-07T20:31:46.7725768Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:46.7725946Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:46.7726086Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.7726095Z 2025-05-07T20:31:46.7726310Z self = 2025-05-07T20:31:46.7727149Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:46.7727652Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c54c484c0>} 2025-05-07T20:31:46.7728407Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:46.7728602Z context = 2025-05-07T20:31:46.7728612Z 2025-05-07T20:31:46.7728781Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:46.7729063Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:46.7729175Z module_map=module_map) 2025-05-07T20:31:46.7729349Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:46.7729450Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:46.7729531Z E ^ 2025-05-07T20:31:46.7729895Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:46.7729900Z 2025-05-07T20:31:46.7730313Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:46.7730318Z 2025-05-07T20:31:46.7730425Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:46.7730655Z self=, 2025-05-07T20:31:46.7730822Z T=16384, 2025-05-07T20:31:46.7730909Z D=7168, 2025-05-07T20:31:46.7731097Z scale_ub=1200.0, 2025-05-07T20:31:46.7731188Z contiguous=False, 2025-05-07T20:31:46.7731284Z compiled=True, 2025-05-07T20:31:46.7731361Z ) 2025-05-07T20:31:46.7731580Z self = 2025-05-07T20:31:46.7731778Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:46.7731783Z 2025-05-07T20:31:46.7731863Z @given( 2025-05-07T20:31:46.7731985Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:46.7732101Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:46.7732221Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:46.7732345Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:46.7732459Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:46.7732538Z ) 2025-05-07T20:31:46.7732805Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:46.7732900Z def test_silu_mul_quant( 2025-05-07T20:31:46.7732984Z self, 2025-05-07T20:31:46.7733071Z T: int, 2025-05-07T20:31:46.7733151Z D: int, 2025-05-07T20:31:46.7733253Z scale_ub: Optional[float], 2025-05-07T20:31:46.7733354Z contiguous: bool, 2025-05-07T20:31:46.7733444Z compiled: bool, 2025-05-07T20:31:46.7733525Z ) -> None: 2025-05-07T20:31:46.7733631Z torch.manual_seed(2025) 2025-05-07T20:31:46.7733706Z 2025-05-07T20:31:46.7733889Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:46.7733969Z 2025-05-07T20:31:46.7734069Z x_sign = torch.sign(x) 2025-05-07T20:31:46.7734202Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:46.7734295Z x = x_sign * x_clamp 2025-05-07T20:31:46.7734378Z x0 = x[:, :D] 2025-05-07T20:31:46.7734469Z x1 = x[:, D:] 2025-05-07T20:31:46.7734552Z 2025-05-07T20:31:46.7734639Z if contiguous: 2025-05-07T20:31:46.7734753Z x0 = x0.contiguous() 2025-05-07T20:31:46.7734846Z x1 = x1.contiguous() 2025-05-07T20:31:46.7734921Z 2025-05-07T20:31:46.7735024Z if scale_ub is not None: 2025-05-07T20:31:46.7735135Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:46.7735283Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:46.7735363Z ) 2025-05-07T20:31:46.7735446Z else: 2025-05-07T20:31:46.7735552Z scale_ub_tensor = None 2025-05-07T20:31:46.7735628Z 2025-05-07T20:31:46.7735760Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:46.7735860Z op = silu_mul_quant 2025-05-07T20:31:46.7735948Z if compiled: 2025-05-07T20:31:46.7736053Z op = torch.compile(op) 2025-05-07T20:31:46.7736172Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.7736252Z 2025-05-07T20:31:46.7736347Z > y_fp8, y_scale = fn() 2025-05-07T20:31:46.7736352Z 2025-05-07T20:31:46.7736472Z moe/activation_test.py:117: 2025-05-07T20:31:46.7736603Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.7736717Z moe/activation_test.py:115: in fn 2025-05-07T20:31:46.7736821Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.7737197Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:46.7737304Z return fn(*args, **kwargs) 2025-05-07T20:31:46.7737795Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:46.7737897Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:46.7738264Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:46.7738488Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:46.7739004Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:46.7739104Z kernel = self.compile( 2025-05-07T20:31:46.7739494Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:46.7739687Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:46.7739912Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.7739918Z 2025-05-07T20:31:46.7740132Z self = 2025-05-07T20:31:46.7740907Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:46.7741421Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c54c495a0>} 2025-05-07T20:31:46.7742169Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:46.7742359Z context = 2025-05-07T20:31:46.7742364Z 2025-05-07T20:31:46.7742545Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:46.7742811Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:46.7742917Z module_map=module_map) 2025-05-07T20:31:46.7743092Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:46.7743193Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:46.7743281Z E ^ 2025-05-07T20:31:46.7743635Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:46.7743640Z 2025-05-07T20:31:46.7744061Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:46.7744066Z 2025-05-07T20:31:46.7744169Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:46.7744398Z self=, 2025-05-07T20:31:46.7744472Z T=1, 2025-05-07T20:31:46.7744550Z D=7168, 2025-05-07T20:31:46.7744641Z scale_ub=None, 2025-05-07T20:31:46.7744732Z contiguous=False, 2025-05-07T20:31:46.7744815Z compiled=False, 2025-05-07T20:31:46.7744895Z ) 2025-05-07T20:31:46.7745108Z self = 2025-05-07T20:31:46.7745282Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:46.7745292Z 2025-05-07T20:31:46.7745369Z @given( 2025-05-07T20:31:46.7745493Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:46.7745598Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:46.7745711Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:46.7745825Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:46.7745947Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:46.7746020Z ) 2025-05-07T20:31:46.7746270Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:46.7746370Z def test_silu_mul_quant( 2025-05-07T20:31:46.7746446Z self, 2025-05-07T20:31:46.7746528Z T: int, 2025-05-07T20:31:46.7746606Z D: int, 2025-05-07T20:31:46.7746705Z scale_ub: Optional[float], 2025-05-07T20:31:46.7746803Z contiguous: bool, 2025-05-07T20:31:46.7746889Z compiled: bool, 2025-05-07T20:31:46.7747056Z ) -> None: 2025-05-07T20:31:46.7747157Z torch.manual_seed(2025) 2025-05-07T20:31:46.7747232Z 2025-05-07T20:31:46.7747472Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:46.7747554Z 2025-05-07T20:31:46.7747648Z x_sign = torch.sign(x) 2025-05-07T20:31:46.7747771Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:46.7747869Z x = x_sign * x_clamp 2025-05-07T20:31:46.7747952Z x0 = x[:, :D] 2025-05-07T20:31:46.7748038Z x1 = x[:, D:] 2025-05-07T20:31:46.7748111Z 2025-05-07T20:31:46.7748195Z if contiguous: 2025-05-07T20:31:46.7748294Z x0 = x0.contiguous() 2025-05-07T20:31:46.7748383Z x1 = x1.contiguous() 2025-05-07T20:31:46.7748459Z 2025-05-07T20:31:46.7748559Z if scale_ub is not None: 2025-05-07T20:31:46.7748665Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:46.7748800Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:46.7748887Z ) 2025-05-07T20:31:46.7748964Z else: 2025-05-07T20:31:46.7749058Z scale_ub_tensor = None 2025-05-07T20:31:46.7749145Z 2025-05-07T20:31:46.7749274Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:46.7749364Z op = silu_mul_quant 2025-05-07T20:31:46.7749459Z if compiled: 2025-05-07T20:31:46.7749563Z op = torch.compile(op) 2025-05-07T20:31:46.7749679Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.7749753Z 2025-05-07T20:31:46.7749844Z > y_fp8, y_scale = fn() 2025-05-07T20:31:46.7749849Z 2025-05-07T20:31:46.7749953Z moe/activation_test.py:117: 2025-05-07T20:31:46.7750082Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.7750185Z moe/activation_test.py:115: in fn 2025-05-07T20:31:46.7750291Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.7750783Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:46.7750901Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:46.7751264Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:46.7751486Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:46.7751839Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:46.7751935Z kernel = self.compile( 2025-05-07T20:31:46.7752315Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:46.7752499Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:46.7752627Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.7752631Z 2025-05-07T20:31:46.7752851Z self = 2025-05-07T20:31:46.7753628Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:46.7754134Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c54c49d80>} 2025-05-07T20:31:46.7754869Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:46.7755060Z context = 2025-05-07T20:31:46.7755065Z 2025-05-07T20:31:46.7755243Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:46.7755588Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:46.7755770Z module_map=module_map) 2025-05-07T20:31:46.7755935Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:46.7756035Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:46.7756119Z E ^ 2025-05-07T20:31:46.7756473Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:46.7756478Z 2025-05-07T20:31:46.7756944Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:46.7756957Z 2025-05-07T20:31:46.7757059Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:46.7757278Z self=, 2025-05-07T20:31:46.7757360Z T=2048, 2025-05-07T20:31:46.7757438Z D=7168, 2025-05-07T20:31:46.7757524Z scale_ub=None, 2025-05-07T20:31:46.7757620Z contiguous=False, 2025-05-07T20:31:46.7757700Z compiled=True, 2025-05-07T20:31:46.7757779Z ) 2025-05-07T20:31:46.7757999Z self = 2025-05-07T20:31:46.7758174Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:46.7758178Z 2025-05-07T20:31:46.7758255Z @given( 2025-05-07T20:31:46.7758382Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:46.7758479Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:46.7758600Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:46.7758716Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:46.7758830Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:46.7758908Z ) 2025-05-07T20:31:46.7759156Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:46.7759249Z def test_silu_mul_quant( 2025-05-07T20:31:46.7759340Z self, 2025-05-07T20:31:46.7759415Z T: int, 2025-05-07T20:31:46.7759489Z D: int, 2025-05-07T20:31:46.7759601Z scale_ub: Optional[float], 2025-05-07T20:31:46.7759692Z contiguous: bool, 2025-05-07T20:31:46.7759777Z compiled: bool, 2025-05-07T20:31:46.7759861Z ) -> None: 2025-05-07T20:31:46.7759952Z torch.manual_seed(2025) 2025-05-07T20:31:46.7760030Z 2025-05-07T20:31:46.7760196Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:46.7760270Z 2025-05-07T20:31:46.7760369Z x_sign = torch.sign(x) 2025-05-07T20:31:46.7760491Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:46.7760583Z x = x_sign * x_clamp 2025-05-07T20:31:46.7760670Z x0 = x[:, :D] 2025-05-07T20:31:46.7760746Z x1 = x[:, D:] 2025-05-07T20:31:46.7760818Z 2025-05-07T20:31:46.7760907Z if contiguous: 2025-05-07T20:31:46.7760999Z x0 = x0.contiguous() 2025-05-07T20:31:46.7761091Z x1 = x1.contiguous() 2025-05-07T20:31:46.7761172Z 2025-05-07T20:31:46.7761269Z if scale_ub is not None: 2025-05-07T20:31:46.7761380Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:46.7761514Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:46.7761589Z ) 2025-05-07T20:31:46.7761669Z else: 2025-05-07T20:31:46.7761764Z scale_ub_tensor = None 2025-05-07T20:31:46.7761837Z 2025-05-07T20:31:46.7761969Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:46.7762058Z op = silu_mul_quant 2025-05-07T20:31:46.7762143Z if compiled: 2025-05-07T20:31:46.7762257Z op = torch.compile(op) 2025-05-07T20:31:46.7762363Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.7762440Z 2025-05-07T20:31:46.7762538Z > y_fp8, y_scale = fn() 2025-05-07T20:31:46.7762542Z 2025-05-07T20:31:46.7762749Z moe/activation_test.py:117: 2025-05-07T20:31:46.7762890Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.7763067Z moe/activation_test.py:115: in fn 2025-05-07T20:31:46.7763170Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.7763548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:46.7763643Z return fn(*args, **kwargs) 2025-05-07T20:31:46.7764134Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:46.7764240Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:46.7764594Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:46.7764826Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:46.7765171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:46.7765272Z kernel = self.compile( 2025-05-07T20:31:46.7765665Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:46.7765843Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:46.7765967Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.7765978Z 2025-05-07T20:31:46.7766183Z self = 2025-05-07T20:31:46.7766966Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:46.7767480Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c54c4af80>} 2025-05-07T20:31:46.7768239Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:46.7768436Z context = 2025-05-07T20:31:46.7768441Z 2025-05-07T20:31:46.7768608Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:46.7768867Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:46.7768984Z module_map=module_map) 2025-05-07T20:31:46.7769146Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:46.7769253Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:46.7769329Z E ^ 2025-05-07T20:31:46.7769685Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:46.7769696Z 2025-05-07T20:31:46.7770128Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:46.7770133Z 2025-05-07T20:31:46.7770236Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:46.7770465Z self=, 2025-05-07T20:31:46.7770541Z T=4096, 2025-05-07T20:31:46.7770615Z D=7168, 2025-05-07T20:31:46.7770701Z scale_ub=None, 2025-05-07T20:31:46.7770786Z contiguous=False, 2025-05-07T20:31:46.7770869Z compiled=True, 2025-05-07T20:31:46.7770951Z ) 2025-05-07T20:31:46.7771166Z self = 2025-05-07T20:31:46.7771337Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:46.7771342Z 2025-05-07T20:31:46.7771425Z @given( 2025-05-07T20:31:46.7771542Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:46.7771725Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:46.7771922Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:46.7772042Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:46.7772159Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:46.7772232Z ) 2025-05-07T20:31:46.7772480Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:46.7772578Z def test_silu_mul_quant( 2025-05-07T20:31:46.7772657Z self, 2025-05-07T20:31:46.7772736Z T: int, 2025-05-07T20:31:46.7772816Z D: int, 2025-05-07T20:31:46.7772913Z scale_ub: Optional[float], 2025-05-07T20:31:46.7773003Z contiguous: bool, 2025-05-07T20:31:46.7773096Z compiled: bool, 2025-05-07T20:31:46.7773174Z ) -> None: 2025-05-07T20:31:46.7773267Z torch.manual_seed(2025) 2025-05-07T20:31:46.7773349Z 2025-05-07T20:31:46.7773524Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:46.7773605Z 2025-05-07T20:31:46.7773705Z x_sign = torch.sign(x) 2025-05-07T20:31:46.7773828Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:46.7773925Z x = x_sign * x_clamp 2025-05-07T20:31:46.7774003Z x0 = x[:, :D] 2025-05-07T20:31:46.7774083Z x1 = x[:, D:] 2025-05-07T20:31:46.7774159Z 2025-05-07T20:31:46.7774241Z if contiguous: 2025-05-07T20:31:46.7774332Z x0 = x0.contiguous() 2025-05-07T20:31:46.7774426Z x1 = x1.contiguous() 2025-05-07T20:31:46.7774498Z 2025-05-07T20:31:46.7774590Z if scale_ub is not None: 2025-05-07T20:31:46.7774698Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:46.7774832Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:46.7774916Z ) 2025-05-07T20:31:46.7774993Z else: 2025-05-07T20:31:46.7775087Z scale_ub_tensor = None 2025-05-07T20:31:46.7775172Z 2025-05-07T20:31:46.7775301Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:46.7775396Z op = silu_mul_quant 2025-05-07T20:31:46.7775488Z if compiled: 2025-05-07T20:31:46.7775588Z op = torch.compile(op) 2025-05-07T20:31:46.7775694Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.7775770Z 2025-05-07T20:31:46.7775861Z > y_fp8, y_scale = fn() 2025-05-07T20:31:46.7775865Z 2025-05-07T20:31:46.7775963Z moe/activation_test.py:117: 2025-05-07T20:31:46.7776098Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.7776198Z moe/activation_test.py:115: in fn 2025-05-07T20:31:46.7776304Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.7776669Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:46.7776761Z return fn(*args, **kwargs) 2025-05-07T20:31:46.7777272Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:46.7777371Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:46.7777726Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:46.7777958Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:46.7778298Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:46.7778400Z kernel = self.compile( 2025-05-07T20:31:46.7778778Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:46.7778952Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:46.7779084Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.7779178Z 2025-05-07T20:31:46.7779383Z self = 2025-05-07T20:31:46.7780288Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:46.7780797Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c54c4be20>} 2025-05-07T20:31:46.7781542Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:46.7781730Z context = 2025-05-07T20:31:46.7781735Z 2025-05-07T20:31:46.7781898Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:46.7782181Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:46.7782288Z module_map=module_map) 2025-05-07T20:31:46.7782451Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:46.7782558Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:46.7782633Z E ^ 2025-05-07T20:31:46.7782998Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:46.7783003Z 2025-05-07T20:31:46.7783413Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:46.7783418Z 2025-05-07T20:31:46.7783524Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:46.7783751Z self=, 2025-05-07T20:31:46.7783836Z T=16384, 2025-05-07T20:31:46.7783914Z D=5120, 2025-05-07T20:31:46.7784005Z scale_ub=1200.0, 2025-05-07T20:31:46.7784091Z contiguous=False, 2025-05-07T20:31:46.7784186Z compiled=False, 2025-05-07T20:31:46.7784259Z ) 2025-05-07T20:31:46.7784471Z self = 2025-05-07T20:31:46.7784657Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:46.7784662Z 2025-05-07T20:31:46.7784741Z @given( 2025-05-07T20:31:46.7784857Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:46.7784960Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:46.7785075Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:46.7785191Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:46.7785311Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:46.7785385Z ) 2025-05-07T20:31:46.7785638Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:46.7785736Z def test_silu_mul_quant( 2025-05-07T20:31:46.7785812Z self, 2025-05-07T20:31:46.7785897Z T: int, 2025-05-07T20:31:46.7785972Z D: int, 2025-05-07T20:31:46.7786073Z scale_ub: Optional[float], 2025-05-07T20:31:46.7786169Z contiguous: bool, 2025-05-07T20:31:46.7786253Z compiled: bool, 2025-05-07T20:31:46.7786331Z ) -> None: 2025-05-07T20:31:46.7786427Z torch.manual_seed(2025) 2025-05-07T20:31:46.7786499Z 2025-05-07T20:31:46.7786665Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:46.7786743Z 2025-05-07T20:31:46.7786835Z x_sign = torch.sign(x) 2025-05-07T20:31:46.7786967Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:46.7787058Z x = x_sign * x_clamp 2025-05-07T20:31:46.7787139Z x0 = x[:, :D] 2025-05-07T20:31:46.7787224Z x1 = x[:, D:] 2025-05-07T20:31:46.7787297Z 2025-05-07T20:31:46.7787464Z if contiguous: 2025-05-07T20:31:46.7787563Z x0 = x0.contiguous() 2025-05-07T20:31:46.7787724Z x1 = x1.contiguous() 2025-05-07T20:31:46.7787800Z 2025-05-07T20:31:46.7787897Z if scale_ub is not None: 2025-05-07T20:31:46.7788000Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:46.7788134Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:46.7788215Z ) 2025-05-07T20:31:46.7788293Z else: 2025-05-07T20:31:46.7788395Z scale_ub_tensor = None 2025-05-07T20:31:46.7788465Z 2025-05-07T20:31:46.7788596Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:46.7788689Z op = silu_mul_quant 2025-05-07T20:31:46.7788774Z if compiled: 2025-05-07T20:31:46.7788874Z op = torch.compile(op) 2025-05-07T20:31:46.7788984Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.7789054Z 2025-05-07T20:31:46.7789145Z > y_fp8, y_scale = fn() 2025-05-07T20:31:46.7789153Z 2025-05-07T20:31:46.7789259Z moe/activation_test.py:117: 2025-05-07T20:31:46.7789393Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.7789502Z moe/activation_test.py:115: in fn 2025-05-07T20:31:46.7789601Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.7791374Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:46.7791506Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:46.7791948Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:46.7792207Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:46.7792628Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:46.7792727Z kernel = self.compile( 2025-05-07T20:31:46.7793133Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:46.7793311Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:46.7793438Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.7793444Z 2025-05-07T20:31:46.7793655Z self = 2025-05-07T20:31:46.7794423Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:46.7794924Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c54b517e0>} 2025-05-07T20:31:46.7795676Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:46.7795869Z context = 2025-05-07T20:31:46.7795881Z 2025-05-07T20:31:46.7796047Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:46.7796311Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:46.7796428Z module_map=module_map) 2025-05-07T20:31:46.7796589Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:46.7796689Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:46.7796772Z E ^ 2025-05-07T20:31:46.7797121Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:46.7797126Z 2025-05-07T20:31:46.7797544Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:46.7797779Z 2025-05-07T20:31:46.7797991Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:46.7798214Z self=, 2025-05-07T20:31:46.7798303Z T=16384, 2025-05-07T20:31:46.7798377Z D=5120, 2025-05-07T20:31:46.7798462Z scale_ub=1200.0, 2025-05-07T20:31:46.7798553Z contiguous=True, 2025-05-07T20:31:46.7798636Z compiled=True, 2025-05-07T20:31:46.7798711Z ) 2025-05-07T20:31:46.7798932Z self = 2025-05-07T20:31:46.7799108Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:46.7799113Z 2025-05-07T20:31:46.7799200Z @given( 2025-05-07T20:31:46.7799319Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:46.7799420Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:46.7799547Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:46.7799664Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:46.7799783Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:46.7799866Z ) 2025-05-07T20:31:46.7800120Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:46.7800214Z def test_silu_mul_quant( 2025-05-07T20:31:46.7800299Z self, 2025-05-07T20:31:46.7800375Z T: int, 2025-05-07T20:31:46.7800457Z D: int, 2025-05-07T20:31:46.7800556Z scale_ub: Optional[float], 2025-05-07T20:31:46.7800645Z contiguous: bool, 2025-05-07T20:31:46.7800735Z compiled: bool, 2025-05-07T20:31:46.7800816Z ) -> None: 2025-05-07T20:31:46.7800909Z torch.manual_seed(2025) 2025-05-07T20:31:46.7800988Z 2025-05-07T20:31:46.7801157Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:46.7801232Z 2025-05-07T20:31:46.7801335Z x_sign = torch.sign(x) 2025-05-07T20:31:46.7801459Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:46.7801551Z x = x_sign * x_clamp 2025-05-07T20:31:46.7801637Z x0 = x[:, :D] 2025-05-07T20:31:46.7801717Z x1 = x[:, D:] 2025-05-07T20:31:46.7801798Z 2025-05-07T20:31:46.7801882Z if contiguous: 2025-05-07T20:31:46.7801974Z x0 = x0.contiguous() 2025-05-07T20:31:46.7802070Z x1 = x1.contiguous() 2025-05-07T20:31:46.7802145Z 2025-05-07T20:31:46.7802234Z if scale_ub is not None: 2025-05-07T20:31:46.7802344Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:46.7802477Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:46.7802553Z ) 2025-05-07T20:31:46.7802638Z else: 2025-05-07T20:31:46.7802732Z scale_ub_tensor = None 2025-05-07T20:31:46.7802808Z 2025-05-07T20:31:46.7802937Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:46.7803033Z op = silu_mul_quant 2025-05-07T20:31:46.7803124Z if compiled: 2025-05-07T20:31:46.7803227Z op = torch.compile(op) 2025-05-07T20:31:46.7803332Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.7803410Z 2025-05-07T20:31:46.7803503Z > y_fp8, y_scale = fn() 2025-05-07T20:31:46.7803507Z 2025-05-07T20:31:46.7803610Z moe/activation_test.py:117: 2025-05-07T20:31:46.7803738Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.7803839Z moe/activation_test.py:115: in fn 2025-05-07T20:31:46.7803943Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.7804306Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:46.7804401Z return fn(*args, **kwargs) 2025-05-07T20:31:46.7804907Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:46.7805096Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:46.7805537Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:46.7805763Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:46.7806106Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:46.7806208Z kernel = self.compile( 2025-05-07T20:31:46.7806586Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:46.7806765Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:46.7806893Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.7806898Z 2025-05-07T20:31:46.7807101Z self = 2025-05-07T20:31:46.7807903Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:46.7808404Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c54b51090>} 2025-05-07T20:31:46.7809147Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:46.7809336Z context = 2025-05-07T20:31:46.7809341Z 2025-05-07T20:31:46.7809506Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:46.7809773Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:46.7809886Z module_map=module_map) 2025-05-07T20:31:46.7810059Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:46.7810156Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:46.7810231Z E ^ 2025-05-07T20:31:46.7810589Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:46.7810594Z 2025-05-07T20:31:46.7811005Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:46.7811010Z 2025-05-07T20:31:46.7811124Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:46.7811344Z self=, 2025-05-07T20:31:46.7811427Z T=16384, 2025-05-07T20:31:46.7811513Z D=5120, 2025-05-07T20:31:46.7811594Z scale_ub=None, 2025-05-07T20:31:46.7811686Z contiguous=False, 2025-05-07T20:31:46.7811776Z compiled=True, 2025-05-07T20:31:46.7811849Z ) 2025-05-07T20:31:46.7812069Z self = 2025-05-07T20:31:46.7812251Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:46.7812255Z 2025-05-07T20:31:46.7812329Z @given( 2025-05-07T20:31:46.7812451Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:46.7812551Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:46.7812662Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:46.7812782Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:46.7812896Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:46.7812973Z ) 2025-05-07T20:31:46.7813226Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:46.7813322Z def test_silu_mul_quant( 2025-05-07T20:31:46.7813398Z self, 2025-05-07T20:31:46.7813567Z T: int, 2025-05-07T20:31:46.7813642Z D: int, 2025-05-07T20:31:46.7813740Z scale_ub: Optional[float], 2025-05-07T20:31:46.7813907Z contiguous: bool, 2025-05-07T20:31:46.7813997Z compiled: bool, 2025-05-07T20:31:46.7814079Z ) -> None: 2025-05-07T20:31:46.7814172Z torch.manual_seed(2025) 2025-05-07T20:31:46.7814247Z 2025-05-07T20:31:46.7814424Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:46.7814500Z 2025-05-07T20:31:46.7814592Z x_sign = torch.sign(x) 2025-05-07T20:31:46.7814721Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:46.7814812Z x = x_sign * x_clamp 2025-05-07T20:31:46.7814892Z x0 = x[:, :D] 2025-05-07T20:31:46.7814977Z x1 = x[:, D:] 2025-05-07T20:31:46.7815052Z 2025-05-07T20:31:46.7815134Z if contiguous: 2025-05-07T20:31:46.7815232Z x0 = x0.contiguous() 2025-05-07T20:31:46.7815328Z x1 = x1.contiguous() 2025-05-07T20:31:46.7815401Z 2025-05-07T20:31:46.7815498Z if scale_ub is not None: 2025-05-07T20:31:46.7815610Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:46.7815748Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:46.7815821Z ) 2025-05-07T20:31:46.7815898Z else: 2025-05-07T20:31:46.7815998Z scale_ub_tensor = None 2025-05-07T20:31:46.7816072Z 2025-05-07T20:31:46.7816203Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:46.7816301Z op = silu_mul_quant 2025-05-07T20:31:46.7816390Z if compiled: 2025-05-07T20:31:46.7816501Z op = torch.compile(op) 2025-05-07T20:31:46.7816630Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.7816718Z 2025-05-07T20:31:46.7816821Z > y_fp8, y_scale = fn() 2025-05-07T20:31:46.7816830Z 2025-05-07T20:31:46.7816928Z moe/activation_test.py:117: 2025-05-07T20:31:46.7817061Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.7817171Z moe/activation_test.py:115: in fn 2025-05-07T20:31:46.7817276Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.7817643Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:46.7817747Z return fn(*args, **kwargs) 2025-05-07T20:31:46.7818234Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:46.7818337Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:46.7818694Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:46.7818923Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:46.7819270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:46.7819373Z kernel = self.compile( 2025-05-07T20:31:46.7819765Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:46.7820049Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:46.7820178Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.7820183Z 2025-05-07T20:31:46.7820400Z self = 2025-05-07T20:31:46.7821167Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:46.7821672Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c54b52290>} 2025-05-07T20:31:46.7822620Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:46.7822812Z context = 2025-05-07T20:31:46.7822817Z 2025-05-07T20:31:46.7822990Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:46.7823253Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:46.7823359Z module_map=module_map) 2025-05-07T20:31:46.7823527Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:46.7823627Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:46.7823711Z E ^ 2025-05-07T20:31:46.7824058Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:46.7824069Z 2025-05-07T20:31:46.7824484Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:46.7824488Z 2025-05-07T20:31:46.7824600Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:46.7824819Z self=, 2025-05-07T20:31:46.7824900Z T=2048, 2025-05-07T20:31:46.7824974Z D=5120, 2025-05-07T20:31:46.7825053Z scale_ub=None, 2025-05-07T20:31:46.7825142Z contiguous=False, 2025-05-07T20:31:46.7825223Z compiled=True, 2025-05-07T20:31:46.7825298Z ) 2025-05-07T20:31:46.7825518Z self = 2025-05-07T20:31:46.7825688Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:46.7825693Z 2025-05-07T20:31:46.7825767Z @given( 2025-05-07T20:31:46.7825891Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:46.7825993Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:46.7826114Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:46.7826234Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:46.7826346Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:46.7826425Z ) 2025-05-07T20:31:46.7826672Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:46.7826766Z def test_silu_mul_quant( 2025-05-07T20:31:46.7826849Z self, 2025-05-07T20:31:46.7826922Z T: int, 2025-05-07T20:31:46.7826995Z D: int, 2025-05-07T20:31:46.7827099Z scale_ub: Optional[float], 2025-05-07T20:31:46.7827185Z contiguous: bool, 2025-05-07T20:31:46.7827269Z compiled: bool, 2025-05-07T20:31:46.7827349Z ) -> None: 2025-05-07T20:31:46.7827442Z torch.manual_seed(2025) 2025-05-07T20:31:46.7827521Z 2025-05-07T20:31:46.7827686Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:46.7827763Z 2025-05-07T20:31:46.7827861Z x_sign = torch.sign(x) 2025-05-07T20:31:46.7827990Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:46.7828079Z x = x_sign * x_clamp 2025-05-07T20:31:46.7828161Z x0 = x[:, :D] 2025-05-07T20:31:46.7828241Z x1 = x[:, D:] 2025-05-07T20:31:46.7828312Z 2025-05-07T20:31:46.7828399Z if contiguous: 2025-05-07T20:31:46.7828490Z x0 = x0.contiguous() 2025-05-07T20:31:46.7828577Z x1 = x1.contiguous() 2025-05-07T20:31:46.7828653Z 2025-05-07T20:31:46.7828742Z if scale_ub is not None: 2025-05-07T20:31:46.7828847Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:46.7828989Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:46.7829067Z ) 2025-05-07T20:31:46.7829148Z else: 2025-05-07T20:31:46.7829241Z scale_ub_tensor = None 2025-05-07T20:31:46.7829314Z 2025-05-07T20:31:46.7829537Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:46.7829625Z op = silu_mul_quant 2025-05-07T20:31:46.7829783Z if compiled: 2025-05-07T20:31:46.7829891Z op = torch.compile(op) 2025-05-07T20:31:46.7829996Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.7830064Z 2025-05-07T20:31:46.7830160Z > y_fp8, y_scale = fn() 2025-05-07T20:31:46.7830164Z 2025-05-07T20:31:46.7830259Z moe/activation_test.py:117: 2025-05-07T20:31:46.7830390Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.7830490Z moe/activation_test.py:115: in fn 2025-05-07T20:31:46.7830589Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.7830961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:46.7831053Z return fn(*args, **kwargs) 2025-05-07T20:31:46.7831539Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:46.7831653Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:46.7832010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:46.7832237Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:46.7832572Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:46.7832666Z kernel = self.compile( 2025-05-07T20:31:46.7833047Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:46.7833224Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:46.7833352Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.7833363Z 2025-05-07T20:31:46.7833568Z self = 2025-05-07T20:31:46.7834338Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:46.7834837Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c54b52170>} 2025-05-07T20:31:46.7835573Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:46.7835765Z context = 2025-05-07T20:31:46.7835770Z 2025-05-07T20:31:46.7835932Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:46.7836195Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:46.7836311Z module_map=module_map) 2025-05-07T20:31:46.7836471Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:46.7836574Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:46.7836649Z E ^ 2025-05-07T20:31:46.7837004Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:46.7837009Z 2025-05-07T20:31:46.7837422Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:46.7837426Z 2025-05-07T20:31:46.7837529Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:46.7837747Z self=, 2025-05-07T20:31:46.7837828Z T=2048, 2025-05-07T20:31:46.7837903Z D=5120, 2025-05-07T20:31:46.7838074Z scale_ub=1200.0, 2025-05-07T20:31:46.7838160Z contiguous=False, 2025-05-07T20:31:46.7838243Z compiled=True, 2025-05-07T20:31:46.7838395Z ) 2025-05-07T20:31:46.7838614Z self = 2025-05-07T20:31:46.7838791Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:46.7838796Z 2025-05-07T20:31:46.7838874Z @given( 2025-05-07T20:31:46.7838991Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:46.7839089Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:46.7839206Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:46.7839326Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:46.7839445Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:46.7839520Z ) 2025-05-07T20:31:46.7839764Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:46.7839867Z def test_silu_mul_quant( 2025-05-07T20:31:46.7839944Z self, 2025-05-07T20:31:46.7840016Z T: int, 2025-05-07T20:31:46.7840097Z D: int, 2025-05-07T20:31:46.7840198Z scale_ub: Optional[float], 2025-05-07T20:31:46.7840287Z contiguous: bool, 2025-05-07T20:31:46.7840377Z compiled: bool, 2025-05-07T20:31:46.7840455Z ) -> None: 2025-05-07T20:31:46.7840547Z torch.manual_seed(2025) 2025-05-07T20:31:46.7840624Z 2025-05-07T20:31:46.7840790Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:46.7840868Z 2025-05-07T20:31:46.7840959Z x_sign = torch.sign(x) 2025-05-07T20:31:46.7841078Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:46.7841178Z x = x_sign * x_clamp 2025-05-07T20:31:46.7841259Z x0 = x[:, :D] 2025-05-07T20:31:46.7841343Z x1 = x[:, D:] 2025-05-07T20:31:46.7841416Z 2025-05-07T20:31:46.7845879Z if contiguous: 2025-05-07T20:31:46.7846003Z x0 = x0.contiguous() 2025-05-07T20:31:46.7846094Z x1 = x1.contiguous() 2025-05-07T20:31:46.7846169Z 2025-05-07T20:31:46.7846276Z if scale_ub is not None: 2025-05-07T20:31:46.7846388Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:46.7846529Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:46.7846606Z ) 2025-05-07T20:31:46.7846683Z else: 2025-05-07T20:31:46.7846779Z scale_ub_tensor = None 2025-05-07T20:31:46.7846851Z 2025-05-07T20:31:46.7846983Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:46.7847077Z op = silu_mul_quant 2025-05-07T20:31:46.7847167Z if compiled: 2025-05-07T20:31:46.7847266Z op = torch.compile(op) 2025-05-07T20:31:46.7847383Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.7847452Z 2025-05-07T20:31:46.7847542Z > y_fp8, y_scale = fn() 2025-05-07T20:31:46.7847548Z 2025-05-07T20:31:46.7847657Z moe/activation_test.py:117: 2025-05-07T20:31:46.7847789Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.7847895Z moe/activation_test.py:115: in fn 2025-05-07T20:31:46.7848000Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.7848374Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:46.7848469Z return fn(*args, **kwargs) 2025-05-07T20:31:46.7848964Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:46.7849063Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:46.7849421Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:46.7849638Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:46.7849974Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:46.7850185Z kernel = self.compile( 2025-05-07T20:31:46.7850636Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:46.7850818Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:46.7850943Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.7850948Z 2025-05-07T20:31:46.7851152Z self = 2025-05-07T20:31:46.7851927Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:46.7852424Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c54b53880>} 2025-05-07T20:31:46.7853181Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:46.7853371Z context = 2025-05-07T20:31:46.7853376Z 2025-05-07T20:31:46.7853539Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:46.7853804Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:46.7853913Z module_map=module_map) 2025-05-07T20:31:46.7854080Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:46.7854176Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:46.7854253Z E ^ 2025-05-07T20:31:46.7854609Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:46.7854618Z 2025-05-07T20:31:46.7855031Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:46.7855036Z 2025-05-07T20:31:46.7855145Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:46.7855365Z self=, 2025-05-07T20:31:46.7855442Z T=4096, 2025-05-07T20:31:46.7855524Z D=5120, 2025-05-07T20:31:46.7855607Z scale_ub=1200.0, 2025-05-07T20:31:46.7855689Z contiguous=True, 2025-05-07T20:31:46.7855775Z compiled=True, 2025-05-07T20:31:46.7855849Z ) 2025-05-07T20:31:46.7856064Z self = 2025-05-07T20:31:46.7856241Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:46.7856246Z 2025-05-07T20:31:46.7856320Z @given( 2025-05-07T20:31:46.7856445Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:46.7856551Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:46.7856685Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:46.7856832Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:46.7856941Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:46.7857014Z ) 2025-05-07T20:31:46.7857263Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:46.7857355Z def test_silu_mul_quant( 2025-05-07T20:31:46.7857429Z self, 2025-05-07T20:31:46.7857512Z T: int, 2025-05-07T20:31:46.7857586Z D: int, 2025-05-07T20:31:46.7857686Z scale_ub: Optional[float], 2025-05-07T20:31:46.7857786Z contiguous: bool, 2025-05-07T20:31:46.7857869Z compiled: bool, 2025-05-07T20:31:46.7857952Z ) -> None: 2025-05-07T20:31:46.7858051Z torch.manual_seed(2025) 2025-05-07T20:31:46.7858120Z 2025-05-07T20:31:46.7858400Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:46.7858474Z 2025-05-07T20:31:46.7858642Z x_sign = torch.sign(x) 2025-05-07T20:31:46.7858772Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:46.7858860Z x = x_sign * x_clamp 2025-05-07T20:31:46.7858939Z x0 = x[:, :D] 2025-05-07T20:31:46.7859027Z x1 = x[:, D:] 2025-05-07T20:31:46.7859099Z 2025-05-07T20:31:46.7859181Z if contiguous: 2025-05-07T20:31:46.7859279Z x0 = x0.contiguous() 2025-05-07T20:31:46.7859368Z x1 = x1.contiguous() 2025-05-07T20:31:46.7859445Z 2025-05-07T20:31:46.7859531Z if scale_ub is not None: 2025-05-07T20:31:46.7859637Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:46.7859777Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:46.7859953Z ) 2025-05-07T20:31:46.7860030Z else: 2025-05-07T20:31:46.7860138Z scale_ub_tensor = None 2025-05-07T20:31:46.7860211Z 2025-05-07T20:31:46.7860340Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:46.7860440Z op = silu_mul_quant 2025-05-07T20:31:46.7860522Z if compiled: 2025-05-07T20:31:46.7860621Z op = torch.compile(op) 2025-05-07T20:31:46.7860728Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.7860798Z 2025-05-07T20:31:46.7860888Z > y_fp8, y_scale = fn() 2025-05-07T20:31:46.7860899Z 2025-05-07T20:31:46.7860994Z moe/activation_test.py:117: 2025-05-07T20:31:46.7861122Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.7861227Z moe/activation_test.py:115: in fn 2025-05-07T20:31:46.7861323Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.7861693Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:46.7861790Z return fn(*args, **kwargs) 2025-05-07T20:31:46.7862289Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:46.7862398Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:46.7862757Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:46.7862977Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:46.7863318Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:46.7863414Z kernel = self.compile( 2025-05-07T20:31:46.7863797Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:46.7863976Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:46.7864100Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.7864109Z 2025-05-07T20:31:46.7864321Z self = 2025-05-07T20:31:46.7865092Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:46.7865599Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c4fe44940>} 2025-05-07T20:31:46.7866341Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:46.7866529Z context = 2025-05-07T20:31:46.7866534Z 2025-05-07T20:31:46.7866785Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:46.7867116Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:46.7867223Z module_map=module_map) 2025-05-07T20:31:46.7867392Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:46.7867491Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:46.7867571Z E ^ 2025-05-07T20:31:46.7867920Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:46.7867925Z 2025-05-07T20:31:46.7868333Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:46.7868337Z 2025-05-07T20:31:46.7868442Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:46.7868660Z self=, 2025-05-07T20:31:46.7868746Z T=128, 2025-05-07T20:31:46.7868824Z D=5120, 2025-05-07T20:31:46.7868906Z scale_ub=1200.0, 2025-05-07T20:31:46.7869007Z contiguous=False, 2025-05-07T20:31:46.7869088Z compiled=True, 2025-05-07T20:31:46.7869162Z ) 2025-05-07T20:31:46.7869379Z self = 2025-05-07T20:31:46.7869546Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:46.7869551Z 2025-05-07T20:31:46.7869628Z @given( 2025-05-07T20:31:46.7869745Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:46.7869846Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:46.7869958Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:46.7870072Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:46.7870193Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:46.7870269Z ) 2025-05-07T20:31:46.7870518Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:46.7870614Z def test_silu_mul_quant( 2025-05-07T20:31:46.7870687Z self, 2025-05-07T20:31:46.7870770Z T: int, 2025-05-07T20:31:46.7870846Z D: int, 2025-05-07T20:31:46.7870941Z scale_ub: Optional[float], 2025-05-07T20:31:46.7871032Z contiguous: bool, 2025-05-07T20:31:46.7871115Z compiled: bool, 2025-05-07T20:31:46.7871190Z ) -> None: 2025-05-07T20:31:46.7871282Z torch.manual_seed(2025) 2025-05-07T20:31:46.7871354Z 2025-05-07T20:31:46.7871519Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:46.7871595Z 2025-05-07T20:31:46.7871682Z x_sign = torch.sign(x) 2025-05-07T20:31:46.7871806Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:46.7871892Z x = x_sign * x_clamp 2025-05-07T20:31:46.7871968Z x0 = x[:, :D] 2025-05-07T20:31:46.7872051Z x1 = x[:, D:] 2025-05-07T20:31:46.7872118Z 2025-05-07T20:31:46.7872203Z if contiguous: 2025-05-07T20:31:46.7872295Z x0 = x0.contiguous() 2025-05-07T20:31:46.7872387Z x1 = x1.contiguous() 2025-05-07T20:31:46.7872457Z 2025-05-07T20:31:46.7872552Z if scale_ub is not None: 2025-05-07T20:31:46.7872654Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:46.7872789Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:46.7872869Z ) 2025-05-07T20:31:46.7872942Z else: 2025-05-07T20:31:46.7873035Z scale_ub_tensor = None 2025-05-07T20:31:46.7873112Z 2025-05-07T20:31:46.7873241Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:46.7873332Z op = silu_mul_quant 2025-05-07T20:31:46.7873417Z if compiled: 2025-05-07T20:31:46.7873514Z op = torch.compile(op) 2025-05-07T20:31:46.7873620Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.7873691Z 2025-05-07T20:31:46.7873867Z > y_fp8, y_scale = fn() 2025-05-07T20:31:46.7873871Z 2025-05-07T20:31:46.7873973Z moe/activation_test.py:117: 2025-05-07T20:31:46.7874171Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.7874275Z moe/activation_test.py:115: in fn 2025-05-07T20:31:46.7874379Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.7874743Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:46.7874839Z return fn(*args, **kwargs) 2025-05-07T20:31:46.7875337Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:46.7875434Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:46.7875793Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:46.7876011Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:46.7876357Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:46.7876458Z kernel = self.compile( 2025-05-07T20:31:46.7876842Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:46.7877019Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:46.7877141Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.7877146Z 2025-05-07T20:31:46.7877350Z self = 2025-05-07T20:31:46.7878122Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:46.7878626Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c4fe451b0>} 2025-05-07T20:31:46.7879380Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:46.7879568Z context = 2025-05-07T20:31:46.7879573Z 2025-05-07T20:31:46.7879741Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:46.7879999Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:46.7880105Z module_map=module_map) 2025-05-07T20:31:46.7880270Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:46.7880365Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:46.7880436Z E ^ 2025-05-07T20:31:46.7880798Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:46.7880809Z 2025-05-07T20:31:46.7881222Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:46.7881227Z 2025-05-07T20:31:46.7881332Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:46.7881550Z self=, 2025-05-07T20:31:46.7881623Z T=16384, 2025-05-07T20:31:46.7881699Z D=7168, 2025-05-07T20:31:46.7881777Z scale_ub=1200.0, 2025-05-07T20:31:46.7881859Z contiguous=True, 2025-05-07T20:31:46.7881943Z compiled=True, 2025-05-07T20:31:46.7882015Z ) 2025-05-07T20:31:46.7882228Z self = 2025-05-07T20:31:46.7882404Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:46.7882493Z 2025-05-07T20:31:46.7882569Z @given( 2025-05-07T20:31:46.7882689Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:46.7882861Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:46.7882976Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:46.7883097Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:46.7883207Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:46.7883277Z ) 2025-05-07T20:31:46.7883522Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:46.7883616Z def test_silu_mul_quant( 2025-05-07T20:31:46.7883696Z self, 2025-05-07T20:31:46.7883772Z T: int, 2025-05-07T20:31:46.7883845Z D: int, 2025-05-07T20:31:46.7883947Z scale_ub: Optional[float], 2025-05-07T20:31:46.7884033Z contiguous: bool, 2025-05-07T20:31:46.7884116Z compiled: bool, 2025-05-07T20:31:46.7884195Z ) -> None: 2025-05-07T20:31:46.7884291Z torch.manual_seed(2025) 2025-05-07T20:31:46.7884359Z 2025-05-07T20:31:46.7884532Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:46.7884604Z 2025-05-07T20:31:46.7884693Z x_sign = torch.sign(x) 2025-05-07T20:31:46.7884820Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:46.7884908Z x = x_sign * x_clamp 2025-05-07T20:31:46.7884992Z x0 = x[:, :D] 2025-05-07T20:31:46.7885071Z x1 = x[:, D:] 2025-05-07T20:31:46.7885141Z 2025-05-07T20:31:46.7885229Z if contiguous: 2025-05-07T20:31:46.7885319Z x0 = x0.contiguous() 2025-05-07T20:31:46.7885405Z x1 = x1.contiguous() 2025-05-07T20:31:46.7885476Z 2025-05-07T20:31:46.7885564Z if scale_ub is not None: 2025-05-07T20:31:46.7885667Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:46.7885804Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:46.7885876Z ) 2025-05-07T20:31:46.7885954Z else: 2025-05-07T20:31:46.7886051Z scale_ub_tensor = None 2025-05-07T20:31:46.7886119Z 2025-05-07T20:31:46.7886251Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:46.7886340Z op = silu_mul_quant 2025-05-07T20:31:46.7886425Z if compiled: 2025-05-07T20:31:46.7886527Z op = torch.compile(op) 2025-05-07T20:31:46.7886629Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.7886699Z 2025-05-07T20:31:46.7886795Z > y_fp8, y_scale = fn() 2025-05-07T20:31:46.7886800Z 2025-05-07T20:31:46.7886895Z moe/activation_test.py:117: 2025-05-07T20:31:46.7887022Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.7887126Z moe/activation_test.py:115: in fn 2025-05-07T20:31:46.7887224Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.7887585Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:46.7887684Z return fn(*args, **kwargs) 2025-05-07T20:31:46.7888184Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:46.7888287Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:46.7888637Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:46.7888858Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:46.7889198Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:46.7889290Z kernel = self.compile( 2025-05-07T20:31:46.7889674Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:46.7890164Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:46.7890489Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.7890493Z 2025-05-07T20:31:46.7890808Z self = 2025-05-07T20:31:46.7891578Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:46.7892077Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c4fe457e0>} 2025-05-07T20:31:46.7892814Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:46.7893001Z context = 2025-05-07T20:31:46.7893013Z 2025-05-07T20:31:46.7893181Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:46.7893445Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:46.7893555Z module_map=module_map) 2025-05-07T20:31:46.7893716Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:46.7893813Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:46.7893893Z E ^ 2025-05-07T20:31:46.7894242Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:46.7894247Z 2025-05-07T20:31:46.7894669Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:46.7894674Z 2025-05-07T20:31:46.7894776Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:46.7894992Z self=, 2025-05-07T20:31:46.7895074Z T=16384, 2025-05-07T20:31:46.7895149Z D=5120, 2025-05-07T20:31:46.7895234Z scale_ub=1200.0, 2025-05-07T20:31:46.7895320Z contiguous=True, 2025-05-07T20:31:46.7895400Z compiled=False, 2025-05-07T20:31:46.7895472Z ) 2025-05-07T20:31:46.7895686Z self = 2025-05-07T20:31:46.7895859Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:46.7895864Z 2025-05-07T20:31:46.7895944Z @given( 2025-05-07T20:31:46.7896059Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:46.7896157Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:46.7896270Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:46.7896380Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:46.7896492Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:46.7896570Z ) 2025-05-07T20:31:46.7896812Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:46.7896910Z def test_silu_mul_quant( 2025-05-07T20:31:46.7896986Z self, 2025-05-07T20:31:46.7897059Z T: int, 2025-05-07T20:31:46.7897136Z D: int, 2025-05-07T20:31:46.7897235Z scale_ub: Optional[float], 2025-05-07T20:31:46.7897320Z contiguous: bool, 2025-05-07T20:31:46.7897408Z compiled: bool, 2025-05-07T20:31:46.7897481Z ) -> None: 2025-05-07T20:31:46.7897572Z torch.manual_seed(2025) 2025-05-07T20:31:46.7897653Z 2025-05-07T20:31:46.7897817Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:46.7897886Z 2025-05-07T20:31:46.7897978Z x_sign = torch.sign(x) 2025-05-07T20:31:46.7898099Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:46.7898186Z x = x_sign * x_clamp 2025-05-07T20:31:46.7898266Z x0 = x[:, :D] 2025-05-07T20:31:46.7898429Z x1 = x[:, D:] 2025-05-07T20:31:46.7898501Z 2025-05-07T20:31:46.7898584Z if contiguous: 2025-05-07T20:31:46.7898747Z x0 = x0.contiguous() 2025-05-07T20:31:46.7898837Z x1 = x1.contiguous() 2025-05-07T20:31:46.7898911Z 2025-05-07T20:31:46.7898999Z if scale_ub is not None: 2025-05-07T20:31:46.7899105Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:46.7899237Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:46.7899311Z ) 2025-05-07T20:31:46.7899387Z else: 2025-05-07T20:31:46.7899477Z scale_ub_tensor = None 2025-05-07T20:31:46.7899548Z 2025-05-07T20:31:46.7899676Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:46.7899761Z op = silu_mul_quant 2025-05-07T20:31:46.7899904Z if compiled: 2025-05-07T20:31:46.7900008Z op = torch.compile(op) 2025-05-07T20:31:46.7900110Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.7900189Z 2025-05-07T20:31:46.7900278Z > y_fp8, y_scale = fn() 2025-05-07T20:31:46.7900283Z 2025-05-07T20:31:46.7900383Z moe/activation_test.py:117: 2025-05-07T20:31:46.7900510Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.7900610Z moe/activation_test.py:115: in fn 2025-05-07T20:31:46.7900706Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.7901202Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:46.7901297Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:46.7901656Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:46.7901878Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:46.7902213Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:46.7902316Z kernel = self.compile( 2025-05-07T20:31:46.7902701Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:46.7902872Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:46.7903001Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.7903005Z 2025-05-07T20:31:46.7903209Z self = 2025-05-07T20:31:46.7903980Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:46.7904476Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c4fe46950>} 2025-05-07T20:31:46.7905232Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:46.7905421Z context = 2025-05-07T20:31:46.7905426Z 2025-05-07T20:31:46.7905588Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:46.7905852Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:46.7905956Z module_map=module_map) 2025-05-07T20:31:46.7906123Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:46.7906218Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:46.7906293Z E ^ 2025-05-07T20:31:46.7906645Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:46.7906733Z 2025-05-07T20:31:46.7907273Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:46.7907278Z 2025-05-07T20:31:46.7907383Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:46.7907605Z self=, 2025-05-07T20:31:46.7907683Z T=1, 2025-05-07T20:31:46.7907756Z D=7168, 2025-05-07T20:31:46.7907837Z scale_ub=1200.0, 2025-05-07T20:31:46.7907922Z contiguous=False, 2025-05-07T20:31:46.7908010Z compiled=False, 2025-05-07T20:31:46.7908085Z ) 2025-05-07T20:31:46.7908299Z self = 2025-05-07T20:31:46.7908468Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:46.7908472Z 2025-05-07T20:31:46.7908546Z @given( 2025-05-07T20:31:46.7908661Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:46.7908768Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:46.7908884Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:46.7909012Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:46.7909122Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:46.7909193Z ) 2025-05-07T20:31:46.7909441Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:46.7909531Z def test_silu_mul_quant( 2025-05-07T20:31:46.7909606Z self, 2025-05-07T20:31:46.7909687Z T: int, 2025-05-07T20:31:46.7909759Z D: int, 2025-05-07T20:31:46.7909854Z scale_ub: Optional[float], 2025-05-07T20:31:46.7909950Z contiguous: bool, 2025-05-07T20:31:46.7910032Z compiled: bool, 2025-05-07T20:31:46.7910107Z ) -> None: 2025-05-07T20:31:46.7910203Z torch.manual_seed(2025) 2025-05-07T20:31:46.7910274Z 2025-05-07T20:31:46.7910443Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:46.7910518Z 2025-05-07T20:31:46.7910608Z x_sign = torch.sign(x) 2025-05-07T20:31:46.7910743Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:46.7910835Z x = x_sign * x_clamp 2025-05-07T20:31:46.7910909Z x0 = x[:, :D] 2025-05-07T20:31:46.7910989Z x1 = x[:, D:] 2025-05-07T20:31:46.7911060Z 2025-05-07T20:31:46.7911140Z if contiguous: 2025-05-07T20:31:46.7911236Z x0 = x0.contiguous() 2025-05-07T20:31:46.7911322Z x1 = x1.contiguous() 2025-05-07T20:31:46.7911392Z 2025-05-07T20:31:46.7911483Z if scale_ub is not None: 2025-05-07T20:31:46.7911584Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:46.7911719Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:46.7911793Z ) 2025-05-07T20:31:46.7911870Z else: 2025-05-07T20:31:46.7911964Z scale_ub_tensor = None 2025-05-07T20:31:46.7912040Z 2025-05-07T20:31:46.7912165Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:46.7912251Z op = silu_mul_quant 2025-05-07T20:31:46.7912338Z if compiled: 2025-05-07T20:31:46.7912437Z op = torch.compile(op) 2025-05-07T20:31:46.7912548Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.7912619Z 2025-05-07T20:31:46.7912709Z > y_fp8, y_scale = fn() 2025-05-07T20:31:46.7912713Z 2025-05-07T20:31:46.7912812Z moe/activation_test.py:117: 2025-05-07T20:31:46.7912937Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.7913043Z moe/activation_test.py:115: in fn 2025-05-07T20:31:46.7913138Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.7913638Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:46.7913738Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:46.7914183Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:46.7914498Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:46.7914842Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:46.7914934Z kernel = self.compile( 2025-05-07T20:31:46.7915320Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:46.7915489Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:46.7915614Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.7915620Z 2025-05-07T20:31:46.7915823Z self = 2025-05-07T20:31:46.7916640Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:46.7917150Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c4fe47ac0>} 2025-05-07T20:31:46.7917886Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:46.7918071Z context = 2025-05-07T20:31:46.7918078Z 2025-05-07T20:31:46.7918240Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:46.7918505Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:46.7918614Z module_map=module_map) 2025-05-07T20:31:46.7918780Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:46.7918884Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:46.7918964Z E ^ 2025-05-07T20:31:46.7919312Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:46.7919316Z 2025-05-07T20:31:46.7919724Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:46.7919729Z 2025-05-07T20:31:46.7919831Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:46.7920049Z self=, 2025-05-07T20:31:46.7920126Z T=4096, 2025-05-07T20:31:46.7920202Z D=7168, 2025-05-07T20:31:46.7920285Z scale_ub=1200.0, 2025-05-07T20:31:46.7920376Z contiguous=False, 2025-05-07T20:31:46.7920455Z compiled=True, 2025-05-07T20:31:46.7920528Z ) 2025-05-07T20:31:46.7920748Z self = 2025-05-07T20:31:46.7920923Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:46.7920928Z 2025-05-07T20:31:46.7921006Z @given( 2025-05-07T20:31:46.7921125Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:46.7921224Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:46.7921345Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:46.7921461Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:46.7921570Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:46.7921644Z ) 2025-05-07T20:31:46.7921890Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:46.7921988Z def test_silu_mul_quant( 2025-05-07T20:31:46.7922064Z self, 2025-05-07T20:31:46.7922137Z T: int, 2025-05-07T20:31:46.7922213Z D: int, 2025-05-07T20:31:46.7922396Z scale_ub: Optional[float], 2025-05-07T20:31:46.7922482Z contiguous: bool, 2025-05-07T20:31:46.7922566Z compiled: bool, 2025-05-07T20:31:46.7922781Z ) -> None: 2025-05-07T20:31:46.7922874Z torch.manual_seed(2025) 2025-05-07T20:31:46.7922950Z 2025-05-07T20:31:46.7923118Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:46.7923193Z 2025-05-07T20:31:46.7923287Z x_sign = torch.sign(x) 2025-05-07T20:31:46.7923407Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:46.7923492Z x = x_sign * x_clamp 2025-05-07T20:31:46.7923572Z x0 = x[:, :D] 2025-05-07T20:31:46.7923651Z x1 = x[:, D:] 2025-05-07T20:31:46.7923726Z 2025-05-07T20:31:46.7923805Z if contiguous: 2025-05-07T20:31:46.7923895Z x0 = x0.contiguous() 2025-05-07T20:31:46.7923984Z x1 = x1.contiguous() 2025-05-07T20:31:46.7924057Z 2025-05-07T20:31:46.7924145Z if scale_ub is not None: 2025-05-07T20:31:46.7924255Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:46.7924390Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:46.7924465Z ) 2025-05-07T20:31:46.7924543Z else: 2025-05-07T20:31:46.7924633Z scale_ub_tensor = None 2025-05-07T20:31:46.7924704Z 2025-05-07T20:31:46.7924832Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:46.7924921Z op = silu_mul_quant 2025-05-07T20:31:46.7925005Z if compiled: 2025-05-07T20:31:46.7925102Z op = torch.compile(op) 2025-05-07T20:31:46.7925204Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.7925278Z 2025-05-07T20:31:46.7925366Z > y_fp8, y_scale = fn() 2025-05-07T20:31:46.7925370Z 2025-05-07T20:31:46.7925464Z moe/activation_test.py:117: 2025-05-07T20:31:46.7925594Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.7925701Z moe/activation_test.py:115: in fn 2025-05-07T20:31:46.7925803Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.7926169Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:46.7926259Z return fn(*args, **kwargs) 2025-05-07T20:31:46.7926748Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:46.7926847Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:46.7927207Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:46.7927425Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:46.7927766Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:46.7927862Z kernel = self.compile( 2025-05-07T20:31:46.7928239Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:46.7928426Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:46.7928548Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.7928552Z 2025-05-07T20:31:46.7928758Z self = 2025-05-07T20:31:46.7929524Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:46.7930018Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c4fc10550>} 2025-05-07T20:31:46.7930756Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:46.7931102Z context = 2025-05-07T20:31:46.7931107Z 2025-05-07T20:31:46.7931273Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:46.7931534Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:46.7931638Z module_map=module_map) 2025-05-07T20:31:46.7931801Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:46.7931898Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:46.7931973Z E ^ 2025-05-07T20:31:46.7932327Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:46.7932331Z 2025-05-07T20:31:46.7932746Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:46.7932755Z 2025-05-07T20:31:46.7932868Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:46.7933084Z self=, 2025-05-07T20:31:46.7933160Z T=128, 2025-05-07T20:31:46.7933234Z D=7168, 2025-05-07T20:31:46.7933312Z scale_ub=1200.0, 2025-05-07T20:31:46.7933397Z contiguous=False, 2025-05-07T20:31:46.7933481Z compiled=True, 2025-05-07T20:31:46.7933551Z ) 2025-05-07T20:31:46.7933760Z self = 2025-05-07T20:31:46.7933929Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:46.7933933Z 2025-05-07T20:31:46.7934010Z @given( 2025-05-07T20:31:46.7934132Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:46.7934227Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:46.7934337Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:46.7934464Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:46.7934582Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:46.7934651Z ) 2025-05-07T20:31:46.7934900Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:46.7934990Z def test_silu_mul_quant( 2025-05-07T20:31:46.7935063Z self, 2025-05-07T20:31:46.7935139Z T: int, 2025-05-07T20:31:46.7935211Z D: int, 2025-05-07T20:31:46.7935308Z scale_ub: Optional[float], 2025-05-07T20:31:46.7935395Z contiguous: bool, 2025-05-07T20:31:46.7935476Z compiled: bool, 2025-05-07T20:31:46.7935557Z ) -> None: 2025-05-07T20:31:46.7935648Z torch.manual_seed(2025) 2025-05-07T20:31:46.7935718Z 2025-05-07T20:31:46.7935891Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:46.7935964Z 2025-05-07T20:31:46.7936051Z x_sign = torch.sign(x) 2025-05-07T20:31:46.7936185Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:46.7936270Z x = x_sign * x_clamp 2025-05-07T20:31:46.7936352Z x0 = x[:, :D] 2025-05-07T20:31:46.7936432Z x1 = x[:, D:] 2025-05-07T20:31:46.7936501Z 2025-05-07T20:31:46.7936583Z if contiguous: 2025-05-07T20:31:46.7936679Z x0 = x0.contiguous() 2025-05-07T20:31:46.7936766Z x1 = x1.contiguous() 2025-05-07T20:31:46.7936844Z 2025-05-07T20:31:46.7936930Z if scale_ub is not None: 2025-05-07T20:31:46.7937030Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:46.7937162Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:46.7937236Z ) 2025-05-07T20:31:46.7937310Z else: 2025-05-07T20:31:46.7937406Z scale_ub_tensor = None 2025-05-07T20:31:46.7937473Z 2025-05-07T20:31:46.7937598Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:46.7937784Z op = silu_mul_quant 2025-05-07T20:31:46.7937867Z if compiled: 2025-05-07T20:31:46.7937965Z op = torch.compile(op) 2025-05-07T20:31:46.7938141Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.7938211Z 2025-05-07T20:31:46.7938306Z > y_fp8, y_scale = fn() 2025-05-07T20:31:46.7938311Z 2025-05-07T20:31:46.7938408Z moe/activation_test.py:117: 2025-05-07T20:31:46.7938535Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.7938635Z moe/activation_test.py:115: in fn 2025-05-07T20:31:46.7938729Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.7939093Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:46.7939189Z return fn(*args, **kwargs) 2025-05-07T20:31:46.7939675Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:46.7939781Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:46.7940240Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:46.7940460Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:46.7940806Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:46.7940896Z kernel = self.compile( 2025-05-07T20:31:46.7941272Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:46.7941446Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:46.7941572Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.7941577Z 2025-05-07T20:31:46.7941785Z self = 2025-05-07T20:31:46.7942552Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:46.7943057Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c4fc10f70>} 2025-05-07T20:31:46.7943796Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:46.7943982Z context = 2025-05-07T20:31:46.7943987Z 2025-05-07T20:31:46.7944153Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:46.7944409Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:46.7944520Z module_map=module_map) 2025-05-07T20:31:46.7944688Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:46.7944784Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:46.7944860Z E ^ 2025-05-07T20:31:46.7945209Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:46.7945214Z 2025-05-07T20:31:46.7945618Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:46.7945626Z 2025-05-07T20:31:46.7945726Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:46.7945943Z self=, 2025-05-07T20:31:46.7946017Z T=2048, 2025-05-07T20:31:46.7946090Z D=7168, 2025-05-07T20:31:46.7946169Z scale_ub=None, 2025-05-07T20:31:46.7946256Z contiguous=True, 2025-05-07T20:31:46.7946418Z compiled=True, 2025-05-07T20:31:46.7946491Z ) 2025-05-07T20:31:46.7946808Z self = 2025-05-07T20:31:46.7946975Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:46.7946979Z 2025-05-07T20:31:46.7947050Z @given( 2025-05-07T20:31:46.7947168Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:46.7947262Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:46.7947379Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:46.7947492Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:46.7947603Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:46.7947674Z ) 2025-05-07T20:31:46.7947919Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:46.7948009Z def test_silu_mul_quant( 2025-05-07T20:31:46.7948087Z self, 2025-05-07T20:31:46.7948160Z T: int, 2025-05-07T20:31:46.7948237Z D: int, 2025-05-07T20:31:46.7948335Z scale_ub: Optional[float], 2025-05-07T20:31:46.7948429Z contiguous: bool, 2025-05-07T20:31:46.7948513Z compiled: bool, 2025-05-07T20:31:46.7948586Z ) -> None: 2025-05-07T20:31:46.7948676Z torch.manual_seed(2025) 2025-05-07T20:31:46.7948748Z 2025-05-07T20:31:46.7948914Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:46.7948985Z 2025-05-07T20:31:46.7949075Z x_sign = torch.sign(x) 2025-05-07T20:31:46.7949194Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:46.7949279Z x = x_sign * x_clamp 2025-05-07T20:31:46.7949358Z x0 = x[:, :D] 2025-05-07T20:31:46.7949431Z x1 = x[:, D:] 2025-05-07T20:31:46.7949503Z 2025-05-07T20:31:46.7949586Z if contiguous: 2025-05-07T20:31:46.7949675Z x0 = x0.contiguous() 2025-05-07T20:31:46.7949759Z x1 = x1.contiguous() 2025-05-07T20:31:46.7949839Z 2025-05-07T20:31:46.7949926Z if scale_ub is not None: 2025-05-07T20:31:46.7950030Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:46.7950165Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:46.7950239Z ) 2025-05-07T20:31:46.7950321Z else: 2025-05-07T20:31:46.7950412Z scale_ub_tensor = None 2025-05-07T20:31:46.7950479Z 2025-05-07T20:31:46.7950607Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:46.7950696Z op = silu_mul_quant 2025-05-07T20:31:46.7950776Z if compiled: 2025-05-07T20:31:46.7950875Z op = torch.compile(op) 2025-05-07T20:31:46.7950978Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.7951049Z 2025-05-07T20:31:46.7951144Z > y_fp8, y_scale = fn() 2025-05-07T20:31:46.7951148Z 2025-05-07T20:31:46.7951245Z moe/activation_test.py:117: 2025-05-07T20:31:46.7951375Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.7951481Z moe/activation_test.py:115: in fn 2025-05-07T20:31:46.7951582Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.7951952Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:46.7952043Z return fn(*args, **kwargs) 2025-05-07T20:31:46.7952528Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:46.7952628Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:46.7952982Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:46.7953202Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:46.7953542Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:46.7953723Z kernel = self.compile( 2025-05-07T20:31:46.7954180Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:46.7954359Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:46.7954485Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.7954490Z 2025-05-07T20:31:46.7954691Z self = 2025-05-07T20:31:46.7955457Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:46.7955958Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c4fc11bd0>} 2025-05-07T20:31:46.7956700Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:46.7956891Z context = 2025-05-07T20:31:46.7956896Z 2025-05-07T20:31:46.7957057Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:46.7957314Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:46.7957418Z module_map=module_map) 2025-05-07T20:31:46.7957578Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:46.7957681Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:46.7957758Z E ^ 2025-05-07T20:31:46.7958105Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:46.7958109Z 2025-05-07T20:31:46.7958533Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:46.7958543Z 2025-05-07T20:31:46.7958644Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:46.7958869Z self=, 2025-05-07T20:31:46.7958944Z T=16384, 2025-05-07T20:31:46.7959015Z D=5120, 2025-05-07T20:31:46.7959095Z scale_ub=None, 2025-05-07T20:31:46.7959176Z contiguous=False, 2025-05-07T20:31:46.7959257Z compiled=False, 2025-05-07T20:31:46.7959328Z ) 2025-05-07T20:31:46.7959540Z self = 2025-05-07T20:31:46.7959713Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:46.7959717Z 2025-05-07T20:31:46.7959792Z @given( 2025-05-07T20:31:46.7959907Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:46.7960002Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:46.7960123Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:46.7960242Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:46.7960359Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:46.7960431Z ) 2025-05-07T20:31:46.7960676Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:46.7960773Z def test_silu_mul_quant( 2025-05-07T20:31:46.7960848Z self, 2025-05-07T20:31:46.7960922Z T: int, 2025-05-07T20:31:46.7961001Z D: int, 2025-05-07T20:31:46.7961098Z scale_ub: Optional[float], 2025-05-07T20:31:46.7961183Z contiguous: bool, 2025-05-07T20:31:46.7961266Z compiled: bool, 2025-05-07T20:31:46.7961341Z ) -> None: 2025-05-07T20:31:46.7961429Z torch.manual_seed(2025) 2025-05-07T20:31:46.7961501Z 2025-05-07T20:31:46.7961666Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:46.7961833Z 2025-05-07T20:31:46.7961927Z x_sign = torch.sign(x) 2025-05-07T20:31:46.7962125Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:46.7963931Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 40.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:46.7963937Z 2025-05-07T20:31:46.7964053Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:31:46.7964058Z 2025-05-07T20:31:46.7964162Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:46.7964382Z self=, 2025-05-07T20:31:46.7964460Z T=4096, 2025-05-07T20:31:46.7964538Z D=7168, 2025-05-07T20:31:46.7964622Z scale_ub=1200.0, 2025-05-07T20:31:46.7964703Z contiguous=True, 2025-05-07T20:31:46.7964784Z compiled=True, 2025-05-07T20:31:46.7964851Z ) 2025-05-07T20:31:46.7965069Z self = 2025-05-07T20:31:46.7965238Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:46.7965242Z 2025-05-07T20:31:46.7965323Z @given( 2025-05-07T20:31:46.7965438Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:46.7965532Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:46.7969898Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:46.7970041Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:46.7970156Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:46.7970238Z ) 2025-05-07T20:31:46.7970497Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:46.7970599Z def test_silu_mul_quant( 2025-05-07T20:31:46.7970677Z self, 2025-05-07T20:31:46.7970753Z T: int, 2025-05-07T20:31:46.7970828Z D: int, 2025-05-07T20:31:46.7970930Z scale_ub: Optional[float], 2025-05-07T20:31:46.7971020Z contiguous: bool, 2025-05-07T20:31:46.7971108Z compiled: bool, 2025-05-07T20:31:46.7971186Z ) -> None: 2025-05-07T20:31:46.7971281Z torch.manual_seed(2025) 2025-05-07T20:31:46.7971359Z 2025-05-07T20:31:46.7971525Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:46.7971600Z 2025-05-07T20:31:46.7971693Z x_sign = torch.sign(x) 2025-05-07T20:31:46.7971820Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:46.7973638Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:46.7973652Z 2025-05-07T20:31:46.7973772Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:31:46.7973777Z 2025-05-07T20:31:46.7973879Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:46.7974106Z self=, 2025-05-07T20:31:46.7974183Z T=16384, 2025-05-07T20:31:46.7974262Z D=7168, 2025-05-07T20:31:46.7974345Z scale_ub=None, 2025-05-07T20:31:46.7974429Z contiguous=False, 2025-05-07T20:31:46.7974514Z compiled=False, 2025-05-07T20:31:46.7974696Z ) 2025-05-07T20:31:46.7974911Z self = 2025-05-07T20:31:46.7975159Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:46.7975164Z 2025-05-07T20:31:46.7975243Z @given( 2025-05-07T20:31:46.7975358Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:46.7975460Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:46.7975574Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:46.7975697Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:46.7975809Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:46.7975883Z ) 2025-05-07T20:31:46.7976134Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:46.7976227Z def test_silu_mul_quant( 2025-05-07T20:31:46.7976302Z self, 2025-05-07T20:31:46.7976380Z T: int, 2025-05-07T20:31:46.7976455Z D: int, 2025-05-07T20:31:46.7976558Z scale_ub: Optional[float], 2025-05-07T20:31:46.7976652Z contiguous: bool, 2025-05-07T20:31:46.7976743Z compiled: bool, 2025-05-07T20:31:46.7976820Z ) -> None: 2025-05-07T20:31:46.7976917Z torch.manual_seed(2025) 2025-05-07T20:31:46.7976989Z 2025-05-07T20:31:46.7977157Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:46.7978935Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:46.7978945Z 2025-05-07T20:31:46.7979068Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:46.7979073Z 2025-05-07T20:31:46.7979177Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:46.7979396Z self=, 2025-05-07T20:31:46.7979475Z T=2048, 2025-05-07T20:31:46.7979551Z D=7168, 2025-05-07T20:31:46.7979634Z scale_ub=1200.0, 2025-05-07T20:31:46.7979721Z contiguous=True, 2025-05-07T20:31:46.7979903Z compiled=True, 2025-05-07T20:31:46.7979978Z ) 2025-05-07T20:31:46.7980197Z self = 2025-05-07T20:31:46.7980365Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:46.7980369Z 2025-05-07T20:31:46.7980452Z @given( 2025-05-07T20:31:46.7980571Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:46.7980669Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:46.7980792Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:46.7980910Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:46.7981026Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:46.7981103Z ) 2025-05-07T20:31:46.7981344Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:46.7981436Z def test_silu_mul_quant( 2025-05-07T20:31:46.7981515Z self, 2025-05-07T20:31:46.7981593Z T: int, 2025-05-07T20:31:46.7981669Z D: int, 2025-05-07T20:31:46.7981766Z scale_ub: Optional[float], 2025-05-07T20:31:46.7981853Z contiguous: bool, 2025-05-07T20:31:46.7981939Z compiled: bool, 2025-05-07T20:31:46.7982016Z ) -> None: 2025-05-07T20:31:46.7982108Z torch.manual_seed(2025) 2025-05-07T20:31:46.7982184Z 2025-05-07T20:31:46.7982346Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:46.7982423Z 2025-05-07T20:31:46.7982600Z x_sign = torch.sign(x) 2025-05-07T20:31:46.7982723Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:46.7984547Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:46.7984554Z 2025-05-07T20:31:46.7984669Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:31:46.7984674Z 2025-05-07T20:31:46.7984777Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:46.7984997Z self=, 2025-05-07T20:31:46.7985078Z T=2048, 2025-05-07T20:31:46.7985158Z D=7168, 2025-05-07T20:31:46.7985242Z scale_ub=None, 2025-05-07T20:31:46.7985333Z contiguous=True, 2025-05-07T20:31:46.7985418Z compiled=False, 2025-05-07T20:31:46.7985490Z ) 2025-05-07T20:31:46.7985701Z self = 2025-05-07T20:31:46.7985873Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:46.7985878Z 2025-05-07T20:31:46.7985953Z @given( 2025-05-07T20:31:46.7986067Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:46.7986167Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:46.7986283Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:46.7986403Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:46.7986513Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:46.7986586Z ) 2025-05-07T20:31:46.7986833Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:46.7986933Z def test_silu_mul_quant( 2025-05-07T20:31:46.7987014Z self, 2025-05-07T20:31:46.7987095Z T: int, 2025-05-07T20:31:46.7987171Z D: int, 2025-05-07T20:31:46.7987268Z scale_ub: Optional[float], 2025-05-07T20:31:46.7987363Z contiguous: bool, 2025-05-07T20:31:46.7987447Z compiled: bool, 2025-05-07T20:31:46.7987530Z ) -> None: 2025-05-07T20:31:46.7987623Z torch.manual_seed(2025) 2025-05-07T20:31:46.7987695Z 2025-05-07T20:31:46.7987863Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:46.7987941Z 2025-05-07T20:31:46.7988032Z > x_sign = torch.sign(x) 2025-05-07T20:31:46.7989789Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:46.7990188Z 2025-05-07T20:31:46.7990332Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:31:46.7990337Z 2025-05-07T20:31:46.7990442Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:46.7990660Z self=, 2025-05-07T20:31:46.7990736Z T=1, 2025-05-07T20:31:46.7990815Z D=7168, 2025-05-07T20:31:46.7990897Z scale_ub=1200.0, 2025-05-07T20:31:46.7990980Z contiguous=True, 2025-05-07T20:31:46.7991065Z compiled=False, 2025-05-07T20:31:46.7991137Z ) 2025-05-07T20:31:46.7991350Z self = 2025-05-07T20:31:46.7991668Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:46.7991673Z 2025-05-07T20:31:46.7991852Z @given( 2025-05-07T20:31:46.7991977Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:46.7992075Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:46.7992186Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:46.7992307Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:46.7992421Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:46.7992506Z ) 2025-05-07T20:31:46.7992747Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:46.7992840Z def test_silu_mul_quant( 2025-05-07T20:31:46.7992918Z self, 2025-05-07T20:31:46.7992992Z T: int, 2025-05-07T20:31:46.7993072Z D: int, 2025-05-07T20:31:46.7993168Z scale_ub: Optional[float], 2025-05-07T20:31:46.7993260Z contiguous: bool, 2025-05-07T20:31:46.7993352Z compiled: bool, 2025-05-07T20:31:46.7993430Z ) -> None: 2025-05-07T20:31:46.7993532Z torch.manual_seed(2025) 2025-05-07T20:31:46.7993606Z 2025-05-07T20:31:46.7993768Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:46.7993844Z 2025-05-07T20:31:46.7993936Z x_sign = torch.sign(x) 2025-05-07T20:31:46.7994058Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:46.7994149Z x = x_sign * x_clamp 2025-05-07T20:31:46.7994229Z x0 = x[:, :D] 2025-05-07T20:31:46.7994307Z x1 = x[:, D:] 2025-05-07T20:31:46.7994380Z 2025-05-07T20:31:46.7994465Z if contiguous: 2025-05-07T20:31:46.7994557Z x0 = x0.contiguous() 2025-05-07T20:31:46.7994645Z x1 = x1.contiguous() 2025-05-07T20:31:46.7994715Z 2025-05-07T20:31:46.7994810Z if scale_ub is not None: 2025-05-07T20:31:46.7994913Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:46.7995052Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:46.7995132Z ) 2025-05-07T20:31:46.7995211Z else: 2025-05-07T20:31:46.7995304Z scale_ub_tensor = None 2025-05-07T20:31:46.7995380Z 2025-05-07T20:31:46.7995507Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:46.7995595Z op = silu_mul_quant 2025-05-07T20:31:46.7995685Z if compiled: 2025-05-07T20:31:46.7995785Z op = torch.compile(op) 2025-05-07T20:31:46.7995891Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.7995966Z 2025-05-07T20:31:46.7996055Z > y_fp8, y_scale = fn() 2025-05-07T20:31:46.7996060Z 2025-05-07T20:31:46.7996161Z moe/activation_test.py:117: 2025-05-07T20:31:46.7996291Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.7996391Z moe/activation_test.py:115: in fn 2025-05-07T20:31:46.7996492Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.7997001Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:46.7997101Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:46.7997467Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:46.7997688Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:46.7998034Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:46.7998129Z kernel = self.compile( 2025-05-07T20:31:46.7998507Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:46.7998684Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:46.7998811Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.7998900Z 2025-05-07T20:31:46.7999110Z self = 2025-05-07T20:31:46.7999979Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:46.8000484Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c4fc13b50>} 2025-05-07T20:31:46.8001230Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:46.8001422Z context = 2025-05-07T20:31:46.8001426Z 2025-05-07T20:31:46.8001599Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:46.8001863Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:46.8001969Z module_map=module_map) 2025-05-07T20:31:46.8002136Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:46.8002233Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:46.8002311Z E ^ 2025-05-07T20:31:46.8002664Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:46.8002669Z 2025-05-07T20:31:46.8003077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:46.8003082Z 2025-05-07T20:31:46.8003187Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:46.8003405Z self=, 2025-05-07T20:31:46.8003491Z T=128, 2025-05-07T20:31:46.8003566Z D=5120, 2025-05-07T20:31:46.8003647Z scale_ub=None, 2025-05-07T20:31:46.8003735Z contiguous=True, 2025-05-07T20:31:46.8003820Z compiled=False, 2025-05-07T20:31:46.8003893Z ) 2025-05-07T20:31:46.8004111Z self = 2025-05-07T20:31:46.8004279Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:46.8004284Z 2025-05-07T20:31:46.8004359Z @given( 2025-05-07T20:31:46.8004479Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:46.8004577Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:46.8004692Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:46.8004810Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:46.8004922Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:46.8004998Z ) 2025-05-07T20:31:46.8005244Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:46.8005342Z def test_silu_mul_quant( 2025-05-07T20:31:46.8005420Z self, 2025-05-07T20:31:46.8005501Z T: int, 2025-05-07T20:31:46.8005577Z D: int, 2025-05-07T20:31:46.8005680Z scale_ub: Optional[float], 2025-05-07T20:31:46.8005768Z contiguous: bool, 2025-05-07T20:31:46.8005856Z compiled: bool, 2025-05-07T20:31:46.8005939Z ) -> None: 2025-05-07T20:31:46.8006033Z torch.manual_seed(2025) 2025-05-07T20:31:46.8006108Z 2025-05-07T20:31:46.8006281Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:46.8006356Z 2025-05-07T20:31:46.8006450Z x_sign = torch.sign(x) 2025-05-07T20:31:46.8006572Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:46.8006661Z x = x_sign * x_clamp 2025-05-07T20:31:46.8006745Z x0 = x[:, :D] 2025-05-07T20:31:46.8006824Z x1 = x[:, D:] 2025-05-07T20:31:46.8006895Z 2025-05-07T20:31:46.8007069Z if contiguous: 2025-05-07T20:31:46.8007160Z x0 = x0.contiguous() 2025-05-07T20:31:46.8007324Z x1 = x1.contiguous() 2025-05-07T20:31:46.8007401Z 2025-05-07T20:31:46.8007493Z if scale_ub is not None: 2025-05-07T20:31:46.8007596Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:46.8007734Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:46.8007811Z ) 2025-05-07T20:31:46.8007890Z else: 2025-05-07T20:31:46.8007985Z scale_ub_tensor = None 2025-05-07T20:31:46.8008056Z 2025-05-07T20:31:46.8008190Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:46.8008279Z op = silu_mul_quant 2025-05-07T20:31:46.8008364Z if compiled: 2025-05-07T20:31:46.8008466Z op = torch.compile(op) 2025-05-07T20:31:46.8008569Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.8008641Z 2025-05-07T20:31:46.8008733Z > y_fp8, y_scale = fn() 2025-05-07T20:31:46.8008744Z 2025-05-07T20:31:46.8008842Z moe/activation_test.py:117: 2025-05-07T20:31:46.8008973Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.8009078Z moe/activation_test.py:115: in fn 2025-05-07T20:31:46.8009174Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.8009670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:46.8009767Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:46.8010121Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:46.8010345Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:46.8010685Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:46.8010781Z kernel = self.compile( 2025-05-07T20:31:46.8011169Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:46.8011348Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:46.8011480Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.8011484Z 2025-05-07T20:31:46.8011687Z self = 2025-05-07T20:31:46.8012459Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:46.8012956Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c4fa54670>} 2025-05-07T20:31:46.8013697Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:46.8013896Z context = 2025-05-07T20:31:46.8013900Z 2025-05-07T20:31:46.8014066Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:46.8014327Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:46.8014433Z module_map=module_map) 2025-05-07T20:31:46.8014593Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:46.8014695Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:46.8014774Z E ^ 2025-05-07T20:31:46.8015129Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:46.8015137Z 2025-05-07T20:31:46.8015632Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:46.8015636Z 2025-05-07T20:31:46.8015810Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:46.8016032Z self=, 2025-05-07T20:31:46.8016109Z T=128, 2025-05-07T20:31:46.8016185Z D=7168, 2025-05-07T20:31:46.8016271Z scale_ub=None, 2025-05-07T20:31:46.8016355Z contiguous=True, 2025-05-07T20:31:46.8016439Z compiled=False, 2025-05-07T20:31:46.8016515Z ) 2025-05-07T20:31:46.8016726Z self = 2025-05-07T20:31:46.8016898Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:46.8016902Z 2025-05-07T20:31:46.8016981Z @given( 2025-05-07T20:31:46.8017100Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:46.8017203Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:46.8017325Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:46.8017439Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:46.8017560Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:46.8017634Z ) 2025-05-07T20:31:46.8017881Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:46.8017977Z def test_silu_mul_quant( 2025-05-07T20:31:46.8018053Z self, 2025-05-07T20:31:46.8018132Z T: int, 2025-05-07T20:31:46.8018206Z D: int, 2025-05-07T20:31:46.8018304Z scale_ub: Optional[float], 2025-05-07T20:31:46.8018395Z contiguous: bool, 2025-05-07T20:31:46.8018481Z compiled: bool, 2025-05-07T20:31:46.8018560Z ) -> None: 2025-05-07T20:31:46.8018657Z torch.manual_seed(2025) 2025-05-07T20:31:46.8018727Z 2025-05-07T20:31:46.8018893Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:46.8018969Z 2025-05-07T20:31:46.8019065Z x_sign = torch.sign(x) 2025-05-07T20:31:46.8019188Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:46.8019287Z x = x_sign * x_clamp 2025-05-07T20:31:46.8019368Z x0 = x[:, :D] 2025-05-07T20:31:46.8019449Z x1 = x[:, D:] 2025-05-07T20:31:46.8019523Z 2025-05-07T20:31:46.8019605Z if contiguous: 2025-05-07T20:31:46.8019700Z x0 = x0.contiguous() 2025-05-07T20:31:46.8019788Z x1 = x1.contiguous() 2025-05-07T20:31:46.8019924Z 2025-05-07T20:31:46.8020018Z if scale_ub is not None: 2025-05-07T20:31:46.8020119Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:46.8020252Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:46.8020329Z ) 2025-05-07T20:31:46.8020407Z else: 2025-05-07T20:31:46.8020501Z scale_ub_tensor = None 2025-05-07T20:31:46.8020574Z 2025-05-07T20:31:46.8020700Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:46.8020794Z op = silu_mul_quant 2025-05-07T20:31:46.8020880Z if compiled: 2025-05-07T20:31:46.8020985Z op = torch.compile(op) 2025-05-07T20:31:46.8021093Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.8021163Z 2025-05-07T20:31:46.8021252Z > y_fp8, y_scale = fn() 2025-05-07T20:31:46.8021257Z 2025-05-07T20:31:46.8021354Z moe/activation_test.py:117: 2025-05-07T20:31:46.8021480Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.8021579Z moe/activation_test.py:115: in fn 2025-05-07T20:31:46.8021679Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.8022172Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:46.8022272Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:46.8022631Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:46.8022939Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:46.8023355Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:46.8023450Z kernel = self.compile( 2025-05-07T20:31:46.8023833Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:46.8024007Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:46.8024132Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.8024137Z 2025-05-07T20:31:46.8024345Z self = 2025-05-07T20:31:46.8025110Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:46.8025616Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c4fa54ee0>} 2025-05-07T20:31:46.8026360Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:46.8026549Z context = 2025-05-07T20:31:46.8026554Z 2025-05-07T20:31:46.8026721Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:46.8026979Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:46.8027087Z module_map=module_map) 2025-05-07T20:31:46.8027249Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:46.8027355Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:46.8027434Z E ^ 2025-05-07T20:31:46.8027795Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:46.8027801Z 2025-05-07T20:31:46.8028217Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:46.8028222Z 2025-05-07T20:31:46.8028327Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:46.8028545Z self=, 2025-05-07T20:31:46.8028626Z T=2048, 2025-05-07T20:31:46.8028704Z D=7168, 2025-05-07T20:31:46.8028788Z scale_ub=1200.0, 2025-05-07T20:31:46.8028876Z contiguous=True, 2025-05-07T20:31:46.8028958Z compiled=False, 2025-05-07T20:31:46.8029031Z ) 2025-05-07T20:31:46.8029248Z self = 2025-05-07T20:31:46.8029425Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:46.8029430Z 2025-05-07T20:31:46.8029508Z @given( 2025-05-07T20:31:46.8029629Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:46.8029724Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:46.8029841Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:46.8029958Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:46.8030071Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:46.8030147Z ) 2025-05-07T20:31:46.8030392Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:46.8030486Z def test_silu_mul_quant( 2025-05-07T20:31:46.8030570Z self, 2025-05-07T20:31:46.8030645Z T: int, 2025-05-07T20:31:46.8030718Z D: int, 2025-05-07T20:31:46.8030819Z scale_ub: Optional[float], 2025-05-07T20:31:46.8030908Z contiguous: bool, 2025-05-07T20:31:46.8031102Z compiled: bool, 2025-05-07T20:31:46.8031183Z ) -> None: 2025-05-07T20:31:46.8031352Z torch.manual_seed(2025) 2025-05-07T20:31:46.8031430Z 2025-05-07T20:31:46.8031599Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:46.8033364Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.70 GiB is allocated by PyTorch, and 53.93 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:46.8033374Z 2025-05-07T20:31:46.8033491Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:46.8033504Z 2025-05-07T20:31:46.8033607Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:46.8033835Z self=, 2025-05-07T20:31:46.8033911Z T=1, 2025-05-07T20:31:46.8033985Z D=5120, 2025-05-07T20:31:46.8034072Z scale_ub=1200.0, 2025-05-07T20:31:46.8034155Z contiguous=True, 2025-05-07T20:31:46.8034238Z compiled=False, 2025-05-07T20:31:46.8034314Z ) 2025-05-07T20:31:46.8034526Z self = 2025-05-07T20:31:46.8034693Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:46.8034697Z 2025-05-07T20:31:46.8034774Z @given( 2025-05-07T20:31:46.8034889Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:46.8034991Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:46.8035106Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:46.8035220Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:46.8035336Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:46.8035409Z ) 2025-05-07T20:31:46.8035659Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:46.8035756Z def test_silu_mul_quant( 2025-05-07T20:31:46.8035831Z self, 2025-05-07T20:31:46.8035908Z T: int, 2025-05-07T20:31:46.8035983Z D: int, 2025-05-07T20:31:46.8036081Z scale_ub: Optional[float], 2025-05-07T20:31:46.8036172Z contiguous: bool, 2025-05-07T20:31:46.8036256Z compiled: bool, 2025-05-07T20:31:46.8036331Z ) -> None: 2025-05-07T20:31:46.8036426Z torch.manual_seed(2025) 2025-05-07T20:31:46.8036498Z 2025-05-07T20:31:46.8036663Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:46.8036739Z 2025-05-07T20:31:46.8036829Z x_sign = torch.sign(x) 2025-05-07T20:31:46.8036953Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:46.8037050Z x = x_sign * x_clamp 2025-05-07T20:31:46.8037129Z x0 = x[:, :D] 2025-05-07T20:31:46.8037217Z x1 = x[:, D:] 2025-05-07T20:31:46.8037287Z 2025-05-07T20:31:46.8037369Z if contiguous: 2025-05-07T20:31:46.8037465Z x0 = x0.contiguous() 2025-05-07T20:31:46.8037551Z x1 = x1.contiguous() 2025-05-07T20:31:46.8037624Z 2025-05-07T20:31:46.8037716Z if scale_ub is not None: 2025-05-07T20:31:46.8037819Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:46.8037951Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:46.8038029Z ) 2025-05-07T20:31:46.8038103Z else: 2025-05-07T20:31:46.8038196Z scale_ub_tensor = None 2025-05-07T20:31:46.8038271Z 2025-05-07T20:31:46.8038396Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:46.8038486Z op = silu_mul_quant 2025-05-07T20:31:46.8038570Z if compiled: 2025-05-07T20:31:46.8038752Z op = torch.compile(op) 2025-05-07T20:31:46.8038862Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.8039007Z 2025-05-07T20:31:46.8039099Z > y_fp8, y_scale = fn() 2025-05-07T20:31:46.8039104Z 2025-05-07T20:31:46.8039204Z moe/activation_test.py:117: 2025-05-07T20:31:46.8039329Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.8039430Z moe/activation_test.py:115: in fn 2025-05-07T20:31:46.8039533Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.8040028Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:46.8040132Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:46.8040485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:46.8040703Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:46.8041057Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:46.8041153Z kernel = self.compile( 2025-05-07T20:31:46.8041531Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:46.8041708Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:46.8041834Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.8041838Z 2025-05-07T20:31:46.8042047Z self = 2025-05-07T20:31:46.8042818Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:46.8043325Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c4fa55e10>} 2025-05-07T20:31:46.8044081Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:46.8044273Z context = 2025-05-07T20:31:46.8044278Z 2025-05-07T20:31:46.8044446Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:46.8044710Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:46.8044821Z module_map=module_map) 2025-05-07T20:31:46.8044981Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:46.8045077Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:46.8045155Z E ^ 2025-05-07T20:31:46.8045507Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:46.8045516Z 2025-05-07T20:31:46.8045925Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:46.8045934Z 2025-05-07T20:31:46.8046037Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:46.8046256Z self=, 2025-05-07T20:31:46.8046335Z T=2048, 2025-05-07T20:31:46.8046409Z D=5120, 2025-05-07T20:31:46.8046488Z scale_ub=None, 2025-05-07T20:31:46.8046574Z contiguous=True, 2025-05-07T20:31:46.8046655Z compiled=False, 2025-05-07T20:31:46.8046727Z ) 2025-05-07T20:31:46.8046942Z self = 2025-05-07T20:31:46.8047113Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:46.8047202Z 2025-05-07T20:31:46.8047280Z @given( 2025-05-07T20:31:46.8047399Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:46.8047570Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:46.8047685Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:46.8047800Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:46.8047913Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:46.8047992Z ) 2025-05-07T20:31:46.8048231Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:46.8048324Z def test_silu_mul_quant( 2025-05-07T20:31:46.8048402Z self, 2025-05-07T20:31:46.8048477Z T: int, 2025-05-07T20:31:46.8048550Z D: int, 2025-05-07T20:31:46.8048651Z scale_ub: Optional[float], 2025-05-07T20:31:46.8048739Z contiguous: bool, 2025-05-07T20:31:46.8048825Z compiled: bool, 2025-05-07T20:31:46.8048901Z ) -> None: 2025-05-07T20:31:46.8049000Z torch.manual_seed(2025) 2025-05-07T20:31:46.8049073Z 2025-05-07T20:31:46.8049246Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:46.8049320Z 2025-05-07T20:31:46.8049414Z > x_sign = torch.sign(x) 2025-05-07T20:31:46.8051175Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:46.8051181Z 2025-05-07T20:31:46.8051302Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:31:46.8051307Z 2025-05-07T20:31:46.8051413Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:46.8051630Z self=, 2025-05-07T20:31:46.8051712Z T=16384, 2025-05-07T20:31:46.8051789Z D=5120, 2025-05-07T20:31:46.8051878Z scale_ub=None, 2025-05-07T20:31:46.8051961Z contiguous=True, 2025-05-07T20:31:46.8052043Z compiled=False, 2025-05-07T20:31:46.8052116Z ) 2025-05-07T20:31:46.8052327Z self = 2025-05-07T20:31:46.8052503Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:46.8052507Z 2025-05-07T20:31:46.8052583Z @given( 2025-05-07T20:31:46.8052700Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:46.8052801Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:46.8052911Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:46.8053028Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:46.8053147Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:46.8053220Z ) 2025-05-07T20:31:46.8053474Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:46.8053567Z def test_silu_mul_quant( 2025-05-07T20:31:46.8053642Z self, 2025-05-07T20:31:46.8053718Z T: int, 2025-05-07T20:31:46.8053792Z D: int, 2025-05-07T20:31:46.8053889Z scale_ub: Optional[float], 2025-05-07T20:31:46.8053978Z contiguous: bool, 2025-05-07T20:31:46.8054062Z compiled: bool, 2025-05-07T20:31:46.8054138Z ) -> None: 2025-05-07T20:31:46.8054239Z torch.manual_seed(2025) 2025-05-07T20:31:46.8054312Z 2025-05-07T20:31:46.8054478Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:46.8056333Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:46.8056487Z 2025-05-07T20:31:46.8056606Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:46.8056613Z 2025-05-07T20:31:46.8056714Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:46.8056929Z self=, 2025-05-07T20:31:46.8057007Z T=4096, 2025-05-07T20:31:46.8057082Z D=5120, 2025-05-07T20:31:46.8057167Z scale_ub=None, 2025-05-07T20:31:46.8057254Z contiguous=True, 2025-05-07T20:31:46.8057336Z compiled=False, 2025-05-07T20:31:46.8057407Z ) 2025-05-07T20:31:46.8057618Z self = 2025-05-07T20:31:46.8057795Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:46.8057806Z 2025-05-07T20:31:46.8057885Z @given( 2025-05-07T20:31:46.8058002Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:46.8058098Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:46.8058216Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:46.8058331Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:46.8058444Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:46.8058522Z ) 2025-05-07T20:31:46.8058765Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:46.8058857Z def test_silu_mul_quant( 2025-05-07T20:31:46.8058936Z self, 2025-05-07T20:31:46.8059010Z T: int, 2025-05-07T20:31:46.8059084Z D: int, 2025-05-07T20:31:46.8059184Z scale_ub: Optional[float], 2025-05-07T20:31:46.8059277Z contiguous: bool, 2025-05-07T20:31:46.8059366Z compiled: bool, 2025-05-07T20:31:46.8059443Z ) -> None: 2025-05-07T20:31:46.8059541Z torch.manual_seed(2025) 2025-05-07T20:31:46.8059620Z 2025-05-07T20:31:46.8059783Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:46.8061591Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:46.8061599Z 2025-05-07T20:31:46.8061718Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:46.8061728Z 2025-05-07T20:31:46.8061830Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:46.8062054Z self=, 2025-05-07T20:31:46.8062131Z T=2048, 2025-05-07T20:31:46.8062206Z D=5120, 2025-05-07T20:31:46.8062294Z scale_ub=None, 2025-05-07T20:31:46.8062380Z contiguous=False, 2025-05-07T20:31:46.8062462Z compiled=False, 2025-05-07T20:31:46.8062538Z ) 2025-05-07T20:31:46.8062748Z self = 2025-05-07T20:31:46.8062919Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:46.8062924Z 2025-05-07T20:31:46.8062999Z @given( 2025-05-07T20:31:46.8063114Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:46.8063212Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:46.8063325Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:46.8063552Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:46.8063668Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:46.8063814Z ) 2025-05-07T20:31:46.8064064Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:46.8064157Z def test_silu_mul_quant( 2025-05-07T20:31:46.8064231Z self, 2025-05-07T20:31:46.8064306Z T: int, 2025-05-07T20:31:46.8064379Z D: int, 2025-05-07T20:31:46.8064476Z scale_ub: Optional[float], 2025-05-07T20:31:46.8064566Z contiguous: bool, 2025-05-07T20:31:46.8064650Z compiled: bool, 2025-05-07T20:31:46.8064725Z ) -> None: 2025-05-07T20:31:46.8064821Z torch.manual_seed(2025) 2025-05-07T20:31:46.8064894Z 2025-05-07T20:31:46.8065059Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:46.8066871Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:46.8066882Z 2025-05-07T20:31:46.8067000Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:46.8067007Z 2025-05-07T20:31:46.8067111Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:46.8067330Z self=, 2025-05-07T20:31:46.8067410Z T=4096, 2025-05-07T20:31:46.8067485Z D=7168, 2025-05-07T20:31:46.8067570Z scale_ub=None, 2025-05-07T20:31:46.8067654Z contiguous=True, 2025-05-07T20:31:46.8067736Z compiled=True, 2025-05-07T20:31:46.8067813Z ) 2025-05-07T20:31:46.8068029Z self = 2025-05-07T20:31:46.8068201Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:46.8068206Z 2025-05-07T20:31:46.8068280Z @given( 2025-05-07T20:31:46.8068396Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:46.8068490Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:46.8068603Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:46.8068716Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:46.8068826Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:46.8068901Z ) 2025-05-07T20:31:46.8069147Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:46.8069239Z def test_silu_mul_quant( 2025-05-07T20:31:46.8069319Z self, 2025-05-07T20:31:46.8069393Z T: int, 2025-05-07T20:31:46.8069467Z D: int, 2025-05-07T20:31:46.8069571Z scale_ub: Optional[float], 2025-05-07T20:31:46.8069659Z contiguous: bool, 2025-05-07T20:31:46.8069750Z compiled: bool, 2025-05-07T20:31:46.8069826Z ) -> None: 2025-05-07T20:31:46.8069919Z torch.manual_seed(2025) 2025-05-07T20:31:46.8069995Z 2025-05-07T20:31:46.8070158Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:46.8071915Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:46.8072010Z 2025-05-07T20:31:46.8072128Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:46.8072132Z 2025-05-07T20:31:46.8072306Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:46.8072531Z self=, 2025-05-07T20:31:46.8072606Z T=2048, 2025-05-07T20:31:46.8072682Z D=5120, 2025-05-07T20:31:46.8072767Z scale_ub=1200.0, 2025-05-07T20:31:46.8072850Z contiguous=False, 2025-05-07T20:31:46.8072932Z compiled=False, 2025-05-07T20:31:46.8073008Z ) 2025-05-07T20:31:46.8073219Z self = 2025-05-07T20:31:46.8073393Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:46.8073398Z 2025-05-07T20:31:46.8073473Z @given( 2025-05-07T20:31:46.8073588Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:46.8073686Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:46.8073805Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:46.8073919Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:46.8074039Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:46.8074112Z ) 2025-05-07T20:31:46.8074361Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:46.8074454Z def test_silu_mul_quant( 2025-05-07T20:31:46.8074527Z self, 2025-05-07T20:31:46.8074603Z T: int, 2025-05-07T20:31:46.8074677Z D: int, 2025-05-07T20:31:46.8074775Z scale_ub: Optional[float], 2025-05-07T20:31:46.8074867Z contiguous: bool, 2025-05-07T20:31:46.8074951Z compiled: bool, 2025-05-07T20:31:46.8075028Z ) -> None: 2025-05-07T20:31:46.8075125Z torch.manual_seed(2025) 2025-05-07T20:31:46.8075196Z 2025-05-07T20:31:46.8075358Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:46.8077120Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:46.8077131Z 2025-05-07T20:31:46.8077247Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:46.8077254Z 2025-05-07T20:31:46.8077354Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:46.8077571Z self=, 2025-05-07T20:31:46.8077650Z T=4096, 2025-05-07T20:31:46.8077724Z D=7168, 2025-05-07T20:31:46.8077805Z scale_ub=1200.0, 2025-05-07T20:31:46.8077896Z contiguous=True, 2025-05-07T20:31:46.8077978Z compiled=False, 2025-05-07T20:31:46.8078049Z ) 2025-05-07T20:31:46.8078271Z self = 2025-05-07T20:31:46.8078442Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:46.8078446Z 2025-05-07T20:31:46.8078524Z @given( 2025-05-07T20:31:46.8078642Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:46.8078739Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:46.8078853Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:46.8078966Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:46.8079078Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:46.8079154Z ) 2025-05-07T20:31:46.8079398Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:46.8079490Z def test_silu_mul_quant( 2025-05-07T20:31:46.8079566Z self, 2025-05-07T20:31:46.8079726Z T: int, 2025-05-07T20:31:46.8079801Z D: int, 2025-05-07T20:31:46.8079901Z scale_ub: Optional[float], 2025-05-07T20:31:46.8080064Z contiguous: bool, 2025-05-07T20:31:46.8080158Z compiled: bool, 2025-05-07T20:31:46.8080234Z ) -> None: 2025-05-07T20:31:46.8080326Z torch.manual_seed(2025) 2025-05-07T20:31:46.8080401Z 2025-05-07T20:31:46.8080566Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:46.8082318Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:46.8082331Z 2025-05-07T20:31:46.8082452Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:46.8082457Z 2025-05-07T20:31:46.8082557Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:46.8082778Z self=, 2025-05-07T20:31:46.8082854Z T=16384, 2025-05-07T20:31:46.8082930Z D=7168, 2025-05-07T20:31:46.8083020Z scale_ub=None, 2025-05-07T20:31:46.8083105Z contiguous=False, 2025-05-07T20:31:46.8083187Z compiled=True, 2025-05-07T20:31:46.8083261Z ) 2025-05-07T20:31:46.8083471Z self = 2025-05-07T20:31:46.8083652Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:46.8083657Z 2025-05-07T20:31:46.8083732Z @given( 2025-05-07T20:31:46.8083848Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:46.8083954Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:46.8084064Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:46.8084183Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:46.8084297Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:46.8084373Z ) 2025-05-07T20:31:46.8084622Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:46.8084714Z def test_silu_mul_quant( 2025-05-07T20:31:46.8084789Z self, 2025-05-07T20:31:46.8084868Z T: int, 2025-05-07T20:31:46.8084941Z D: int, 2025-05-07T20:31:46.8085038Z scale_ub: Optional[float], 2025-05-07T20:31:46.8085129Z contiguous: bool, 2025-05-07T20:31:46.8085214Z compiled: bool, 2025-05-07T20:31:46.8085290Z ) -> None: 2025-05-07T20:31:46.8085388Z torch.manual_seed(2025) 2025-05-07T20:31:46.8085460Z 2025-05-07T20:31:46.8085624Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:46.8087405Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:46.8087410Z 2025-05-07T20:31:46.8087528Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:46.8087535Z 2025-05-07T20:31:46.8087636Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:46.8087853Z self=, 2025-05-07T20:31:46.8087932Z T=4096, 2025-05-07T20:31:46.8088007Z D=7168, 2025-05-07T20:31:46.8088173Z scale_ub=None, 2025-05-07T20:31:46.8088260Z contiguous=True, 2025-05-07T20:31:46.8088347Z compiled=False, 2025-05-07T20:31:46.8088518Z ) 2025-05-07T20:31:46.8088733Z self = 2025-05-07T20:31:46.8088903Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:46.8088907Z 2025-05-07T20:31:46.8088983Z @given( 2025-05-07T20:31:46.8089103Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:46.8089200Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:46.8089312Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:46.8089425Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:46.8089536Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:46.8089613Z ) 2025-05-07T20:31:46.8090121Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:46.8090247Z def test_silu_mul_quant( 2025-05-07T20:31:46.8090330Z self, 2025-05-07T20:31:46.8090404Z T: int, 2025-05-07T20:31:46.8090490Z D: int, 2025-05-07T20:31:46.8090588Z scale_ub: Optional[float], 2025-05-07T20:31:46.8090675Z contiguous: bool, 2025-05-07T20:31:46.8094875Z compiled: bool, 2025-05-07T20:31:46.8094973Z ) -> None: 2025-05-07T20:31:46.8095074Z torch.manual_seed(2025) 2025-05-07T20:31:46.8095152Z 2025-05-07T20:31:46.8095333Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:46.8097121Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:46.8097133Z 2025-05-07T20:31:46.8097254Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:46.8097258Z 2025-05-07T20:31:46.8097365Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:46.8097589Z self=, 2025-05-07T20:31:46.8097666Z T=16384, 2025-05-07T20:31:46.8097745Z D=7168, 2025-05-07T20:31:46.8097826Z scale_ub=None, 2025-05-07T20:31:46.8097909Z contiguous=True, 2025-05-07T20:31:46.8097995Z compiled=False, 2025-05-07T20:31:46.8098068Z ) 2025-05-07T20:31:46.8098282Z self = 2025-05-07T20:31:46.8098457Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:46.8098462Z 2025-05-07T20:31:46.8098537Z @given( 2025-05-07T20:31:46.8098662Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:46.8098757Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:46.8098876Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:46.8099000Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:46.8099112Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:46.8099187Z ) 2025-05-07T20:31:46.8099432Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:46.8099525Z def test_silu_mul_quant( 2025-05-07T20:31:46.8099599Z self, 2025-05-07T20:31:46.8099677Z T: int, 2025-05-07T20:31:46.8099751Z D: int, 2025-05-07T20:31:46.8099917Z scale_ub: Optional[float], 2025-05-07T20:31:46.8100012Z contiguous: bool, 2025-05-07T20:31:46.8100097Z compiled: bool, 2025-05-07T20:31:46.8100180Z ) -> None: 2025-05-07T20:31:46.8100277Z torch.manual_seed(2025) 2025-05-07T20:31:46.8100517Z 2025-05-07T20:31:46.8100700Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:46.8102574Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:46.8102581Z 2025-05-07T20:31:46.8102704Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:46.8102708Z 2025-05-07T20:31:46.8102814Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:46.8103032Z self=, 2025-05-07T20:31:46.8103117Z T=16384, 2025-05-07T20:31:46.8103191Z D=7168, 2025-05-07T20:31:46.8103273Z scale_ub=1200.0, 2025-05-07T20:31:46.8103365Z contiguous=True, 2025-05-07T20:31:46.8103448Z compiled=False, 2025-05-07T20:31:46.8103520Z ) 2025-05-07T20:31:46.8103735Z self = 2025-05-07T20:31:46.8103908Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:46.8103912Z 2025-05-07T20:31:46.8103989Z @given( 2025-05-07T20:31:46.8104104Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:46.8104203Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:46.8104323Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:46.8104437Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:46.8104548Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:46.8104624Z ) 2025-05-07T20:31:46.8104871Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:46.8104971Z def test_silu_mul_quant( 2025-05-07T20:31:46.8105050Z self, 2025-05-07T20:31:46.8105126Z T: int, 2025-05-07T20:31:46.8105204Z D: int, 2025-05-07T20:31:46.8105302Z scale_ub: Optional[float], 2025-05-07T20:31:46.8105390Z contiguous: bool, 2025-05-07T20:31:46.8105482Z compiled: bool, 2025-05-07T20:31:46.8105560Z ) -> None: 2025-05-07T20:31:46.8105654Z torch.manual_seed(2025) 2025-05-07T20:31:46.8105734Z 2025-05-07T20:31:46.8105906Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:46.8107678Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:46.8107687Z 2025-05-07T20:31:46.8107804Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:46.8107809Z 2025-05-07T20:31:46.8107911Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:46.8108132Z self=, 2025-05-07T20:31:46.8108209Z T=128, 2025-05-07T20:31:46.8108290Z D=5120, 2025-05-07T20:31:46.8108371Z scale_ub=1200.0, 2025-05-07T20:31:46.8108455Z contiguous=False, 2025-05-07T20:31:46.8108540Z compiled=False, 2025-05-07T20:31:46.8108612Z ) 2025-05-07T20:31:46.8108825Z self = 2025-05-07T20:31:46.8109000Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:46.8109086Z 2025-05-07T20:31:46.8109162Z @given( 2025-05-07T20:31:46.8109285Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:46.8109454Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:46.8109568Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:46.8109684Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:46.8109799Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:46.8109872Z ) 2025-05-07T20:31:46.8110115Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:46.8110209Z def test_silu_mul_quant( 2025-05-07T20:31:46.8110283Z self, 2025-05-07T20:31:46.8110361Z T: int, 2025-05-07T20:31:46.8110437Z D: int, 2025-05-07T20:31:46.8110536Z scale_ub: Optional[float], 2025-05-07T20:31:46.8110627Z contiguous: bool, 2025-05-07T20:31:46.8110711Z compiled: bool, 2025-05-07T20:31:46.8110790Z ) -> None: 2025-05-07T20:31:46.8110891Z torch.manual_seed(2025) 2025-05-07T20:31:46.8110962Z 2025-05-07T20:31:46.8111133Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:46.8111208Z 2025-05-07T20:31:46.8111301Z x_sign = torch.sign(x) 2025-05-07T20:31:46.8111428Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:46.8111517Z x = x_sign * x_clamp 2025-05-07T20:31:46.8111596Z x0 = x[:, :D] 2025-05-07T20:31:46.8111677Z x1 = x[:, D:] 2025-05-07T20:31:46.8111748Z 2025-05-07T20:31:46.8111831Z if contiguous: 2025-05-07T20:31:46.8111926Z x0 = x0.contiguous() 2025-05-07T20:31:46.8112014Z x1 = x1.contiguous() 2025-05-07T20:31:46.8112086Z 2025-05-07T20:31:46.8112181Z if scale_ub is not None: 2025-05-07T20:31:46.8112284Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:46.8112422Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:46.8112499Z ) 2025-05-07T20:31:46.8112572Z else: 2025-05-07T20:31:46.8112668Z scale_ub_tensor = None 2025-05-07T20:31:46.8112748Z 2025-05-07T20:31:46.8112874Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:46.8112967Z op = silu_mul_quant 2025-05-07T20:31:46.8113051Z if compiled: 2025-05-07T20:31:46.8113150Z op = torch.compile(op) 2025-05-07T20:31:46.8113256Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.8113327Z 2025-05-07T20:31:46.8113415Z > y_fp8, y_scale = fn() 2025-05-07T20:31:46.8113422Z 2025-05-07T20:31:46.8113515Z moe/activation_test.py:117: 2025-05-07T20:31:46.8113639Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.8113741Z moe/activation_test.py:115: in fn 2025-05-07T20:31:46.8113837Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.8114332Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:46.8114436Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:46.8114793Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:46.8115021Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:46.8115358Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:46.8115451Z kernel = self.compile( 2025-05-07T20:31:46.8115832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:46.8116005Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:46.8116131Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.8116136Z 2025-05-07T20:31:46.8116343Z self = 2025-05-07T20:31:46.8117335Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:46.8117844Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c54125cf0>} 2025-05-07T20:31:46.8118585Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:46.8118770Z context = 2025-05-07T20:31:46.8118775Z 2025-05-07T20:31:46.8118939Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:46.8119196Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:46.8119310Z module_map=module_map) 2025-05-07T20:31:46.8119473Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:46.8119568Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:46.8119644Z E ^ 2025-05-07T20:31:46.8119996Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:46.8120001Z 2025-05-07T20:31:46.8120408Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:46.8120413Z 2025-05-07T20:31:46.8120518Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:46.8120734Z self=, 2025-05-07T20:31:46.8120807Z T=2048, 2025-05-07T20:31:46.8120879Z D=7168, 2025-05-07T20:31:46.8120956Z scale_ub=None, 2025-05-07T20:31:46.8121044Z contiguous=False, 2025-05-07T20:31:46.8121126Z compiled=False, 2025-05-07T20:31:46.8121197Z ) 2025-05-07T20:31:46.8121413Z self = 2025-05-07T20:31:46.8121587Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:46.8121592Z 2025-05-07T20:31:46.8121662Z @given( 2025-05-07T20:31:46.8121779Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:46.8121874Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:46.8121986Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:46.8122099Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:46.8122208Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:46.8122277Z ) 2025-05-07T20:31:46.8122520Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:46.8122612Z def test_silu_mul_quant( 2025-05-07T20:31:46.8122693Z self, 2025-05-07T20:31:46.8122771Z T: int, 2025-05-07T20:31:46.8122840Z D: int, 2025-05-07T20:31:46.8122943Z scale_ub: Optional[float], 2025-05-07T20:31:46.8123028Z contiguous: bool, 2025-05-07T20:31:46.8123113Z compiled: bool, 2025-05-07T20:31:46.8123191Z ) -> None: 2025-05-07T20:31:46.8123285Z torch.manual_seed(2025) 2025-05-07T20:31:46.8123354Z 2025-05-07T20:31:46.8123522Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:46.8125284Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 5.24 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:46.8125371Z 2025-05-07T20:31:46.8125586Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:46.8125591Z 2025-05-07T20:31:46.8125692Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:46.8125909Z self=, 2025-05-07T20:31:46.8125989Z T=128, 2025-05-07T20:31:46.8126062Z D=7168, 2025-05-07T20:31:46.8126144Z scale_ub=1200.0, 2025-05-07T20:31:46.8126225Z contiguous=True, 2025-05-07T20:31:46.8126301Z compiled=True, 2025-05-07T20:31:46.8126372Z ) 2025-05-07T20:31:46.8126585Z self = 2025-05-07T20:31:46.8126753Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:46.8126758Z 2025-05-07T20:31:46.8126835Z @given( 2025-05-07T20:31:46.8126949Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:46.8127050Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:46.8127168Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:46.8127281Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:46.8127394Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:46.8127464Z ) 2025-05-07T20:31:46.8127705Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:46.8127797Z def test_silu_mul_quant( 2025-05-07T20:31:46.8127870Z self, 2025-05-07T20:31:46.8127941Z T: int, 2025-05-07T20:31:46.8128015Z D: int, 2025-05-07T20:31:46.8128111Z scale_ub: Optional[float], 2025-05-07T20:31:46.8128198Z contiguous: bool, 2025-05-07T20:31:46.8128284Z compiled: bool, 2025-05-07T20:31:46.8128359Z ) -> None: 2025-05-07T20:31:46.8128449Z torch.manual_seed(2025) 2025-05-07T20:31:46.8128525Z 2025-05-07T20:31:46.8128688Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:46.8128769Z 2025-05-07T20:31:46.8128856Z x_sign = torch.sign(x) 2025-05-07T20:31:46.8128982Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:46.8129073Z x = x_sign * x_clamp 2025-05-07T20:31:46.8129148Z x0 = x[:, :D] 2025-05-07T20:31:46.8129225Z x1 = x[:, D:] 2025-05-07T20:31:46.8129297Z 2025-05-07T20:31:46.8129375Z if contiguous: 2025-05-07T20:31:46.8129464Z x0 = x0.contiguous() 2025-05-07T20:31:46.8129553Z x1 = x1.contiguous() 2025-05-07T20:31:46.8129621Z 2025-05-07T20:31:46.8129710Z if scale_ub is not None: 2025-05-07T20:31:46.8129814Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:46.8129945Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:46.8130023Z ) 2025-05-07T20:31:46.8130099Z else: 2025-05-07T20:31:46.8130191Z scale_ub_tensor = None 2025-05-07T20:31:46.8130270Z 2025-05-07T20:31:46.8130396Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:46.8130487Z op = silu_mul_quant 2025-05-07T20:31:46.8130571Z if compiled: 2025-05-07T20:31:46.8130666Z op = torch.compile(op) 2025-05-07T20:31:46.8130767Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.8130839Z 2025-05-07T20:31:46.8130926Z > y_fp8, y_scale = fn() 2025-05-07T20:31:46.8130931Z 2025-05-07T20:31:46.8131025Z moe/activation_test.py:117: 2025-05-07T20:31:46.8131154Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.8131253Z moe/activation_test.py:115: in fn 2025-05-07T20:31:46.8131353Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.8131718Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:46.8131807Z return fn(*args, **kwargs) 2025-05-07T20:31:46.8132392Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:46.8132562Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:46.8132924Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:46.8133145Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:46.8133480Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:46.8133575Z kernel = self.compile( 2025-05-07T20:31:46.8133951Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:46.8134122Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:46.8134249Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.8134260Z 2025-05-07T20:31:46.8134462Z self = 2025-05-07T20:31:46.8135236Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:46.8135727Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c541270a0>} 2025-05-07T20:31:46.8136466Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:46.8136684Z context = 2025-05-07T20:31:46.8136690Z 2025-05-07T20:31:46.8136870Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:46.8137138Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:46.8137241Z module_map=module_map) 2025-05-07T20:31:46.8137399Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:46.8137499Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:46.8137574Z E ^ 2025-05-07T20:31:46.8137926Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:46.8137931Z 2025-05-07T20:31:46.8138338Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:46.8138343Z 2025-05-07T20:31:46.8138446Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:46.8138666Z self=, 2025-05-07T20:31:46.8138744Z T=128, 2025-05-07T20:31:46.8138817Z D=7168, 2025-05-07T20:31:46.8138898Z scale_ub=1200.0, 2025-05-07T20:31:46.8138977Z contiguous=True, 2025-05-07T20:31:46.8139062Z compiled=False, 2025-05-07T20:31:46.8139130Z ) 2025-05-07T20:31:46.8139341Z self = 2025-05-07T20:31:46.8139511Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:46.8139516Z 2025-05-07T20:31:46.8139586Z @given( 2025-05-07T20:31:46.8139700Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:46.8139800Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:46.8140001Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:46.8140117Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:46.8140231Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:46.8140304Z ) 2025-05-07T20:31:46.8140546Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:46.8140724Z def test_silu_mul_quant( 2025-05-07T20:31:46.8140797Z self, 2025-05-07T20:31:46.8140872Z T: int, 2025-05-07T20:31:46.8141018Z D: int, 2025-05-07T20:31:46.8141117Z scale_ub: Optional[float], 2025-05-07T20:31:46.8141206Z contiguous: bool, 2025-05-07T20:31:46.8141287Z compiled: bool, 2025-05-07T20:31:46.8141359Z ) -> None: 2025-05-07T20:31:46.8141456Z torch.manual_seed(2025) 2025-05-07T20:31:46.8141526Z 2025-05-07T20:31:46.8141693Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:46.8141770Z 2025-05-07T20:31:46.8141861Z x_sign = torch.sign(x) 2025-05-07T20:31:46.8141982Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:46.8143743Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 4.62 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:46.8143754Z 2025-05-07T20:31:46.8143871Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:31:46.8143876Z 2025-05-07T20:31:46.8143976Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:46.8144194Z self=, 2025-05-07T20:31:46.8144271Z T=128, 2025-05-07T20:31:46.8144346Z D=5120, 2025-05-07T20:31:46.8144424Z scale_ub=1200.0, 2025-05-07T20:31:46.8144510Z contiguous=True, 2025-05-07T20:31:46.8144589Z compiled=True, 2025-05-07T20:31:46.8144659Z ) 2025-05-07T20:31:46.8144872Z self = 2025-05-07T20:31:46.8145041Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:46.8145046Z 2025-05-07T20:31:46.8145129Z @given( 2025-05-07T20:31:46.8145243Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:46.8145336Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:46.8145450Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:46.8145563Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:46.8145673Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:46.8145744Z ) 2025-05-07T20:31:46.8145987Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:46.8146077Z def test_silu_mul_quant( 2025-05-07T20:31:46.8146152Z self, 2025-05-07T20:31:46.8146222Z T: int, 2025-05-07T20:31:46.8146295Z D: int, 2025-05-07T20:31:46.8146393Z scale_ub: Optional[float], 2025-05-07T20:31:46.8146478Z contiguous: bool, 2025-05-07T20:31:46.8146566Z compiled: bool, 2025-05-07T20:31:46.8146638Z ) -> None: 2025-05-07T20:31:46.8146733Z torch.manual_seed(2025) 2025-05-07T20:31:46.8146808Z 2025-05-07T20:31:46.8146969Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:46.8147037Z 2025-05-07T20:31:46.8147128Z > x_sign = torch.sign(x) 2025-05-07T20:31:46.8148879Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:46.8148971Z 2025-05-07T20:31:46.8149092Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:31:46.8149096Z 2025-05-07T20:31:46.8149269Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:46.8149489Z self=, 2025-05-07T20:31:46.8149559Z T=128, 2025-05-07T20:31:46.8149631Z D=7168, 2025-05-07T20:31:46.8149711Z scale_ub=None, 2025-05-07T20:31:46.8149790Z contiguous=True, 2025-05-07T20:31:46.8149867Z compiled=True, 2025-05-07T20:31:46.8149939Z ) 2025-05-07T20:31:46.8150148Z self = 2025-05-07T20:31:46.8150310Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:46.8150315Z 2025-05-07T20:31:46.8150388Z @given( 2025-05-07T20:31:46.8150500Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:46.8150595Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:46.8150705Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:46.8150824Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:46.8150943Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:46.8151011Z ) 2025-05-07T20:31:46.8151256Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:46.8151349Z def test_silu_mul_quant( 2025-05-07T20:31:46.8151424Z self, 2025-05-07T20:31:46.8151494Z T: int, 2025-05-07T20:31:46.8151571Z D: int, 2025-05-07T20:31:46.8151667Z scale_ub: Optional[float], 2025-05-07T20:31:46.8151752Z contiguous: bool, 2025-05-07T20:31:46.8151835Z compiled: bool, 2025-05-07T20:31:46.8151911Z ) -> None: 2025-05-07T20:31:46.8152001Z torch.manual_seed(2025) 2025-05-07T20:31:46.8152075Z 2025-05-07T20:31:46.8152241Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:46.8153995Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:46.8154005Z 2025-05-07T20:31:46.8154119Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:46.8154252Z =============================== warnings summary =============================== 2025-05-07T20:31:46.8154561Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:31:46.8154856Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:31:46.8155159Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:31:46.8156028Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details. 2025-05-07T20:31:46.8156257Z warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See " 2025-05-07T20:31:46.8156261Z 2025-05-07T20:31:46.8156437Z experimental/gen_ai/test/moe/activation_test.py: 10 warnings 2025-05-07T20:31:46.8157700Z /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py:72: FutureWarning: `torch.testing.assert_allclose()` is deprecated since 1.12 and will be removed in a future release. Please use `torch.testing.assert_close()` instead. You can find detailed upgrade instructions in https://github.com/pytorch/pytorch/issues/61844. 2025-05-07T20:31:46.8157991Z torch.testing.assert_allclose(y, y_ref, rtol=1.6e-2, atol=1e-3) 2025-05-07T20:31:46.8158067Z 2025-05-07T20:31:46.8158278Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html 2025-05-07T20:31:46.8158443Z ================== 1 failed, 1 passed, 13 warnings in 29.73s =================== 2025-05-07T20:31:48.5859610Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./moe/activation_test.py` failed. (See above for error) 2025-05-07T20:31:48.6487870Z 2025-05-07T20:31:48.6488920Z [TEST] Some tests FAILED. Re-attempting only FAILED tests: ./moe/activation_test.py 2025-05-07T20:31:48.6489303Z 2025-05-07T20:31:48.6489309Z 2025-05-07T20:31:48.6510143Z [EXEC] [ATTEMPT 0/2] + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py 2025-05-07T20:31:50.7966320Z ============================= test session starts ============================== 2025-05-07T20:31:50.7967326Z platform linux -- Python 3.10.13, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:31:50.7967933Z cachedir: .pytest_cache 2025-05-07T20:31:50.7968515Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:31:50.7969240Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:31:50.7969652Z plugins: hypothesis-6.131.14 2025-05-07T20:31:52.4050848Z TMA benchmarks will be running with experimental grid constant TMA descriptor. 2025-05-07T20:31:52.5818396Z collecting ... collected 2 items / 1 deselected / 1 selected 2025-05-07T20:31:52.5819205Z run-last-failure: rerun previous 1 failure 2025-05-07T20:31:52.5819647Z 2025-05-07T20:31:54.7049004Z W0507 20:31:54.703000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:54.7050148Z W0507 20:31:54.703000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Traceback (most recent call last): 2025-05-07T20:31:54.7051480Z W0507 20:31:54.703000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:54.7053008Z W0507 20:31:54.703000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:54.7054380Z W0507 20:31:54.703000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:54.7055776Z W0507 20:31:54.703000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:54.7057083Z W0507 20:31:54.703000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:54.7058444Z W0507 20:31:54.703000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:54.7059927Z W0507 20:31:54.703000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:54.7061577Z W0507 20:31:54.703000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] generator.visit(fn.parse()) 2025-05-07T20:31:54.7062947Z W0507 20:31:54.703000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:54.7064174Z W0507 20:31:54.703000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ret = super().visit(node) 2025-05-07T20:31:54.7065211Z W0507 20:31:54.703000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:31:54.7066220Z W0507 20:31:54.703000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] return visitor(node) 2025-05-07T20:31:54.7067439Z W0507 20:31:54.703000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:54.7068734Z W0507 20:31:54.703000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:54.7069844Z W0507 20:31:54.703000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:31:54.7070870Z W0507 20:31:54.703000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] self.visit(item) 2025-05-07T20:31:54.7072039Z W0507 20:31:54.703000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:54.7073394Z W0507 20:31:54.703000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:54.7074455Z W0507 20:31:54.703000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:54.7075369Z W0507 20:31:54.703000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:54.7076099Z W0507 20:31:54.703000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^ 2025-05-07T20:31:54.7077106Z W0507 20:31:54.703000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:54.7215657Z W0507 20:31:54.721000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:54.7216952Z W0507 20:31:54.721000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Traceback (most recent call last): 2025-05-07T20:31:54.7218291Z W0507 20:31:54.721000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:54.7219698Z W0507 20:31:54.721000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:54.7221170Z W0507 20:31:54.721000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:54.7222960Z W0507 20:31:54.721000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:54.7224251Z W0507 20:31:54.721000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:54.7225613Z W0507 20:31:54.721000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:54.7227012Z W0507 20:31:54.721000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:54.7228243Z W0507 20:31:54.721000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] generator.visit(fn.parse()) 2025-05-07T20:31:54.7229513Z W0507 20:31:54.721000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:54.7230721Z W0507 20:31:54.721000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ret = super().visit(node) 2025-05-07T20:31:54.7231741Z W0507 20:31:54.721000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:31:54.7232745Z W0507 20:31:54.721000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] return visitor(node) 2025-05-07T20:31:54.7233948Z W0507 20:31:54.721000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:54.7235235Z W0507 20:31:54.721000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:54.7236337Z W0507 20:31:54.721000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:31:54.7237359Z W0507 20:31:54.721000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] self.visit(item) 2025-05-07T20:31:54.7238527Z W0507 20:31:54.721000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:54.7239880Z W0507 20:31:54.721000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:54.7240935Z W0507 20:31:54.721000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:54.7241837Z W0507 20:31:54.721000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:54.7242569Z W0507 20:31:54.721000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^ 2025-05-07T20:31:54.7243573Z W0507 20:31:54.721000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:55.2924629Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant( 2025-05-07T20:31:55.2925310Z self=, 2025-05-07T20:31:55.2926068Z T=1, 2025-05-07T20:31:55.2926260Z D=5120, 2025-05-07T20:31:55.2926600Z scale_ub=None, 2025-05-07T20:31:55.2926814Z contiguous=True, 2025-05-07T20:31:55.2927040Z compiled=True, 2025-05-07T20:31:55.2927252Z ) 2025-05-07T20:31:55.2927569Z self = 2025-05-07T20:31:55.2928063Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:55.2928321Z 2025-05-07T20:31:55.2928405Z @given( 2025-05-07T20:31:55.2928640Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:55.2928957Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:55.2929265Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:55.2929599Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:55.2929921Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:55.2930210Z ) 2025-05-07T20:31:55.2930572Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:55.2931004Z def test_silu_mul_quant( 2025-05-07T20:31:55.2931255Z self, 2025-05-07T20:31:55.2931454Z T: int, 2025-05-07T20:31:55.2931650Z D: int, 2025-05-07T20:31:55.2931875Z scale_ub: Optional[float], 2025-05-07T20:31:55.2932152Z contiguous: bool, 2025-05-07T20:31:55.2932387Z compiled: bool, 2025-05-07T20:31:55.2932617Z ) -> None: 2025-05-07T20:31:55.2932840Z torch.manual_seed(2025) 2025-05-07T20:31:55.2933076Z 2025-05-07T20:31:55.2933355Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:55.2933699Z 2025-05-07T20:31:55.2933894Z x_sign = torch.sign(x) 2025-05-07T20:31:55.2934183Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:55.2934495Z x = x_sign * x_clamp 2025-05-07T20:31:55.2934740Z x0 = x[:, :D] 2025-05-07T20:31:55.2934951Z x1 = x[:, D:] 2025-05-07T20:31:55.2935167Z 2025-05-07T20:31:55.2935357Z if contiguous: 2025-05-07T20:31:55.2935586Z x0 = x0.contiguous() 2025-05-07T20:31:55.2935849Z x1 = x1.contiguous() 2025-05-07T20:31:55.2936092Z 2025-05-07T20:31:55.2936284Z if scale_ub is not None: 2025-05-07T20:31:55.2936565Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:55.2936906Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:55.2937207Z ) 2025-05-07T20:31:55.2937405Z else: 2025-05-07T20:31:55.2937627Z scale_ub_tensor = None 2025-05-07T20:31:55.2937876Z 2025-05-07T20:31:55.2938113Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:55.2938426Z op = silu_mul_quant 2025-05-07T20:31:55.2938682Z if compiled: 2025-05-07T20:31:55.2938929Z op = torch.compile(op) 2025-05-07T20:31:55.2939230Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:55.2939515Z 2025-05-07T20:31:55.2939705Z y_fp8, y_scale = fn() 2025-05-07T20:31:55.2940089Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:55.2940398Z 2025-05-07T20:31:55.2940635Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:55.2940974Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:55.2941275Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:55.2941586Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:55.2941944Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:55.2942259Z 2025-05-07T20:31:55.2942457Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:55.2942660Z 2025-05-07T20:31:55.2942764Z moe/activation_test.py:126: 2025-05-07T20:31:55.2943062Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:55.2943396Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:55.2943719Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:55.2944738Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:55.2945483Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:55.2946027Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:55.2946699Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:55.2947380Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:55.2948101Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:55.2948842Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:55.2949584Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:55.2950315Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:55.2950951Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:55.2951541Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:55.2952062Z fn() 2025-05-07T20:31:55.2952568Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:55.2953150Z self.fn.run( 2025-05-07T20:31:55.2953610Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:55.2954137Z kernel = self.compile( 2025-05-07T20:31:55.2954676Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:55.2955325Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:55.2955724Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:55.2955955Z 2025-05-07T20:31:55.2956161Z self = 2025-05-07T20:31:55.2957235Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:55.2958640Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07cfc6f400>} 2025-05-07T20:31:55.2959988Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:55.2970086Z context = 2025-05-07T20:31:55.2970538Z 2025-05-07T20:31:55.2970730Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:55.2971274Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:55.2971756Z module_map=module_map) 2025-05-07T20:31:55.2972139Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:55.2972511Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:55.2972797Z E ^ 2025-05-07T20:31:55.2973275Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:55.2973734Z 2025-05-07T20:31:55.2974166Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:55.2974689Z 2025-05-07T20:31:55.2974932Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:55.2975363Z self=, 2025-05-07T20:31:55.2975852Z T=2048, 2025-05-07T20:31:55.2976060Z D=5120, 2025-05-07T20:31:55.2976267Z scale_ub=1200.0, 2025-05-07T20:31:55.2976498Z contiguous=True, 2025-05-07T20:31:55.2976735Z compiled=False, 2025-05-07T20:31:55.2976961Z ) 2025-05-07T20:31:56.2176260Z W0507 20:31:56.213000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:56.2177354Z W0507 20:31:56.213000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Traceback (most recent call last): 2025-05-07T20:31:56.2178688Z W0507 20:31:56.213000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:56.2180237Z W0507 20:31:56.213000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:56.2181634Z W0507 20:31:56.213000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:56.2183017Z W0507 20:31:56.213000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:56.2184315Z W0507 20:31:56.213000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:56.2185693Z W0507 20:31:56.213000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:56.2187102Z W0507 20:31:56.213000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:56.2188334Z W0507 20:31:56.213000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] generator.visit(fn.parse()) 2025-05-07T20:31:56.2189562Z W0507 20:31:56.213000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:56.2190985Z W0507 20:31:56.213000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ret = super().visit(node) 2025-05-07T20:31:56.2192046Z W0507 20:31:56.213000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:31:56.2193059Z W0507 20:31:56.213000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] return visitor(node) 2025-05-07T20:31:56.2194279Z W0507 20:31:56.213000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:56.2195563Z W0507 20:31:56.213000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:56.2196680Z W0507 20:31:56.213000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:31:56.2198245Z W0507 20:31:56.213000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] self.visit(item) 2025-05-07T20:31:56.2199421Z W0507 20:31:56.213000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:56.2200779Z W0507 20:31:56.213000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:56.2201843Z W0507 20:31:56.213000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:56.2202760Z W0507 20:31:56.213000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:56.2203516Z W0507 20:31:56.213000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^ 2025-05-07T20:31:56.2204534Z W0507 20:31:56.213000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:56.4216147Z W0507 20:31:56.418000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:56.4217753Z W0507 20:31:56.418000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Traceback (most recent call last): 2025-05-07T20:31:56.4219645Z W0507 20:31:56.418000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:56.4221220Z W0507 20:31:56.418000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:56.4222610Z W0507 20:31:56.418000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:56.4223999Z W0507 20:31:56.418000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:56.4225298Z W0507 20:31:56.418000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:56.4226684Z W0507 20:31:56.418000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:56.4228112Z W0507 20:31:56.418000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:56.4229370Z W0507 20:31:56.418000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] generator.visit(fn.parse()) 2025-05-07T20:31:56.4230585Z W0507 20:31:56.418000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:56.4231786Z W0507 20:31:56.418000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ret = super().visit(node) 2025-05-07T20:31:56.4233339Z W0507 20:31:56.418000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:31:56.4234368Z W0507 20:31:56.418000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] return visitor(node) 2025-05-07T20:31:56.4235577Z W0507 20:31:56.418000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:56.4236870Z W0507 20:31:56.418000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:56.4237968Z W0507 20:31:56.418000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:31:56.4239006Z W0507 20:31:56.418000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] self.visit(item) 2025-05-07T20:31:56.4240231Z W0507 20:31:56.418000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:56.4241594Z W0507 20:31:56.418000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:56.4242646Z W0507 20:31:56.418000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:56.4243549Z W0507 20:31:56.418000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:56.4244290Z W0507 20:31:56.418000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^ 2025-05-07T20:31:56.4245316Z W0507 20:31:56.418000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:57.1806401Z self = 2025-05-07T20:31:57.1807112Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:57.1807397Z 2025-05-07T20:31:57.1807481Z @given( 2025-05-07T20:31:57.1807725Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.1808042Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:57.1808344Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:57.1808677Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:57.1809008Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:57.1809291Z ) 2025-05-07T20:31:57.1809686Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:57.1810139Z def test_silu_mul_quant( 2025-05-07T20:31:57.1810408Z self, 2025-05-07T20:31:57.1810619Z T: int, 2025-05-07T20:31:57.1810831Z D: int, 2025-05-07T20:31:57.1811068Z scale_ub: Optional[float], 2025-05-07T20:31:57.1811367Z contiguous: bool, 2025-05-07T20:31:57.1811619Z compiled: bool, 2025-05-07T20:31:57.1811857Z ) -> None: 2025-05-07T20:31:57.1812074Z torch.manual_seed(2025) 2025-05-07T20:31:57.1812323Z 2025-05-07T20:31:57.1812602Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.1812942Z 2025-05-07T20:31:57.1813144Z x_sign = torch.sign(x) 2025-05-07T20:31:57.1813439Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:57.1813747Z x = x_sign * x_clamp 2025-05-07T20:31:57.1813994Z x0 = x[:, :D] 2025-05-07T20:31:57.1814215Z x1 = x[:, D:] 2025-05-07T20:31:57.1814820Z 2025-05-07T20:31:57.1815016Z if contiguous: 2025-05-07T20:31:57.1815263Z x0 = x0.contiguous() 2025-05-07T20:31:57.1815651Z x1 = x1.contiguous() 2025-05-07T20:31:57.1815905Z 2025-05-07T20:31:57.1816107Z if scale_ub is not None: 2025-05-07T20:31:57.1816381Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:57.1816723Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:57.1817038Z ) 2025-05-07T20:31:57.1817240Z else: 2025-05-07T20:31:57.1817454Z scale_ub_tensor = None 2025-05-07T20:31:57.1817707Z 2025-05-07T20:31:57.1817944Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:57.1818253Z op = silu_mul_quant 2025-05-07T20:31:57.1818507Z if compiled: 2025-05-07T20:31:57.1818758Z op = torch.compile(op) 2025-05-07T20:31:57.1819053Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.1819340Z 2025-05-07T20:31:57.1819540Z > y_fp8, y_scale = fn() 2025-05-07T20:31:57.1819707Z 2025-05-07T20:31:57.1819917Z moe/activation_test.py:117: 2025-05-07T20:31:57.1820224Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.1820558Z moe/activation_test.py:115: in fn 2025-05-07T20:31:57.1820848Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.1821535Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:57.1822225Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:57.1822764Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:57.1823438Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:57.1824101Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:57.1824645Z kernel = self.compile( 2025-05-07T20:31:57.1825191Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:57.1825838Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:57.1826237Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.1826463Z 2025-05-07T20:31:57.1826678Z self = 2025-05-07T20:31:57.1827753Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:57.1829120Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07cf43eef0>} 2025-05-07T20:31:57.1830474Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:57.1831491Z context = 2025-05-07T20:31:57.1831780Z 2025-05-07T20:31:57.1831953Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:57.1832470Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:57.1832946Z module_map=module_map) 2025-05-07T20:31:57.1833317Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:57.1833674Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:57.1833939Z E ^ 2025-05-07T20:31:57.1834414Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:57.1834947Z 2025-05-07T20:31:57.1835439Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:57.1835947Z 2025-05-07T20:31:57.1836054Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:57.1836467Z self=, 2025-05-07T20:31:57.1836871Z T=2048, 2025-05-07T20:31:57.1837065Z D=5120, 2025-05-07T20:31:57.1837259Z scale_ub=1200.0, 2025-05-07T20:31:57.1837486Z contiguous=True, 2025-05-07T20:31:57.1837715Z compiled=True, 2025-05-07T20:31:57.1837922Z ) 2025-05-07T20:31:57.1838247Z self = 2025-05-07T20:31:57.1838741Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:57.1839008Z 2025-05-07T20:31:57.1839086Z @given( 2025-05-07T20:31:57.1839324Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.1839683Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:57.1839994Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:57.1840330Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:57.1840662Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:57.1840951Z ) 2025-05-07T20:31:57.1841296Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:57.1841740Z def test_silu_mul_quant( 2025-05-07T20:31:57.1841985Z self, 2025-05-07T20:31:57.1842175Z T: int, 2025-05-07T20:31:57.1842373Z D: int, 2025-05-07T20:31:57.1842598Z scale_ub: Optional[float], 2025-05-07T20:31:57.1842866Z contiguous: bool, 2025-05-07T20:31:57.1843107Z compiled: bool, 2025-05-07T20:31:57.1843339Z ) -> None: 2025-05-07T20:31:57.1843553Z torch.manual_seed(2025) 2025-05-07T20:31:57.1843798Z 2025-05-07T20:31:57.1844079Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.1844422Z 2025-05-07T20:31:57.1844628Z x_sign = torch.sign(x) 2025-05-07T20:31:57.1844930Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:57.1845246Z x = x_sign * x_clamp 2025-05-07T20:31:57.1845490Z x0 = x[:, :D] 2025-05-07T20:31:57.1845715Z x1 = x[:, D:] 2025-05-07T20:31:57.1845932Z 2025-05-07T20:31:57.1846122Z if contiguous: 2025-05-07T20:31:57.1846364Z x0 = x0.contiguous() 2025-05-07T20:31:57.1846631Z x1 = x1.contiguous() 2025-05-07T20:31:57.1846871Z 2025-05-07T20:31:57.1847071Z if scale_ub is not None: 2025-05-07T20:31:57.1847352Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:57.1847688Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:57.1848011Z ) 2025-05-07T20:31:57.1848215Z else: 2025-05-07T20:31:57.1848428Z scale_ub_tensor = None 2025-05-07T20:31:57.1848688Z 2025-05-07T20:31:57.1848938Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:57.1849256Z op = silu_mul_quant 2025-05-07T20:31:57.1849542Z if compiled: 2025-05-07T20:31:57.1849844Z op = torch.compile(op) 2025-05-07T20:31:57.1850148Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.1850428Z 2025-05-07T20:31:57.1850634Z y_fp8, y_scale = fn() 2025-05-07T20:31:57.1850932Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:57.1851225Z 2025-05-07T20:31:57.1851471Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:57.1851813Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:57.1852110Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:57.1852439Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:57.1852807Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:57.1853116Z 2025-05-07T20:31:57.1853327Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:57.1853651Z 2025-05-07T20:31:57.1853759Z moe/activation_test.py:126: 2025-05-07T20:31:57.1854136Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.1854477Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:57.1854809Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:57.1855598Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:57.1856348Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:57.1856902Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:57.1857589Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:57.1858278Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:57.1858999Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:57.1859763Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:57.1860588Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:57.1861316Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:57.1861946Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:57.1862545Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:57.1863064Z fn() 2025-05-07T20:31:57.1863573Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:57.1864160Z self.fn.run( 2025-05-07T20:31:57.1864632Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:57.1865166Z kernel = self.compile( 2025-05-07T20:31:57.1865703Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:57.1866355Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:57.1866756Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.1866982Z 2025-05-07T20:31:57.1867196Z self = 2025-05-07T20:31:57.1868268Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:57.1869641Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07bdf05ab0>} 2025-05-07T20:31:57.1870983Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:57.1872002Z context = 2025-05-07T20:31:57.1872289Z 2025-05-07T20:31:57.1872455Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:57.1872975Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:57.1873443Z module_map=module_map) 2025-05-07T20:31:57.1873814Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:57.1874166Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:57.1874440Z E ^ 2025-05-07T20:31:57.1874906Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:57.1875439Z 2025-05-07T20:31:57.1875924Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:57.1876450Z 2025-05-07T20:31:57.1876555Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:57.1876972Z self=, 2025-05-07T20:31:57.1877374Z T=16384, 2025-05-07T20:31:57.1877569Z D=7168, 2025-05-07T20:31:57.1877774Z scale_ub=1200.0, 2025-05-07T20:31:57.1878006Z contiguous=False, 2025-05-07T20:31:57.1878232Z compiled=False, 2025-05-07T20:31:57.1878448Z ) 2025-05-07T20:31:57.7398775Z W0507 20:31:57.736000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:57.7399995Z W0507 20:31:57.736000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Traceback (most recent call last): 2025-05-07T20:31:57.7401710Z W0507 20:31:57.736000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:57.7403510Z W0507 20:31:57.736000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:57.7404869Z W0507 20:31:57.736000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:57.7406244Z W0507 20:31:57.736000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:57.7407551Z W0507 20:31:57.736000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:57.7408916Z W0507 20:31:57.736000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:57.7410315Z W0507 20:31:57.736000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:57.7411540Z W0507 20:31:57.736000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] generator.visit(fn.parse()) 2025-05-07T20:31:57.7412752Z W0507 20:31:57.736000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:57.7413957Z W0507 20:31:57.736000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ret = super().visit(node) 2025-05-07T20:31:57.7415149Z W0507 20:31:57.736000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:31:57.7416158Z W0507 20:31:57.736000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] return visitor(node) 2025-05-07T20:31:57.7417355Z W0507 20:31:57.736000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:57.7419086Z W0507 20:31:57.736000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:57.7420294Z W0507 20:31:57.736000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:31:57.7421324Z W0507 20:31:57.736000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] self.visit(item) 2025-05-07T20:31:57.7422492Z W0507 20:31:57.736000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:57.7423840Z W0507 20:31:57.736000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:57.7424904Z W0507 20:31:57.736000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:57.7425806Z W0507 20:31:57.736000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:57.7426542Z W0507 20:31:57.736000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^ 2025-05-07T20:31:57.7427547Z W0507 20:31:57.736000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:57.8946566Z W0507 20:31:57.891000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:57.8947662Z W0507 20:31:57.891000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Traceback (most recent call last): 2025-05-07T20:31:57.8949030Z W0507 20:31:57.891000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:57.8950717Z W0507 20:31:57.891000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:57.8952105Z W0507 20:31:57.891000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:57.8953476Z W0507 20:31:57.891000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:57.8955007Z W0507 20:31:57.891000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:57.8956376Z W0507 20:31:57.891000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:57.8957778Z W0507 20:31:57.891000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:57.8959016Z W0507 20:31:57.891000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] generator.visit(fn.parse()) 2025-05-07T20:31:57.8960576Z W0507 20:31:57.891000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:57.8961944Z W0507 20:31:57.891000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ret = super().visit(node) 2025-05-07T20:31:57.8962976Z W0507 20:31:57.891000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:31:57.8963989Z W0507 20:31:57.891000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] return visitor(node) 2025-05-07T20:31:57.8965198Z W0507 20:31:57.891000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:57.8966460Z W0507 20:31:57.891000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:57.8967573Z W0507 20:31:57.891000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:31:57.8968607Z W0507 20:31:57.891000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] self.visit(item) 2025-05-07T20:31:57.8969801Z W0507 20:31:57.891000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:57.8971169Z W0507 20:31:57.891000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:57.8972214Z W0507 20:31:57.891000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:57.8973126Z W0507 20:31:57.891000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:57.8973860Z W0507 20:31:57.891000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^ 2025-05-07T20:31:57.8974872Z W0507 20:31:57.891000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:58.9501831Z self = 2025-05-07T20:31:58.9502466Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:58.9502881Z 2025-05-07T20:31:58.9503009Z @given( 2025-05-07T20:31:58.9503330Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:58.9503803Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:58.9504226Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:58.9504679Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:58.9505102Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:58.9505405Z ) 2025-05-07T20:31:58.9505759Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:58.9506194Z def test_silu_mul_quant( 2025-05-07T20:31:58.9506445Z self, 2025-05-07T20:31:58.9506645Z T: int, 2025-05-07T20:31:58.9506839Z D: int, 2025-05-07T20:31:58.9507063Z scale_ub: Optional[float], 2025-05-07T20:31:58.9507336Z contiguous: bool, 2025-05-07T20:31:58.9507575Z compiled: bool, 2025-05-07T20:31:58.9507803Z ) -> None: 2025-05-07T20:31:58.9508023Z torch.manual_seed(2025) 2025-05-07T20:31:58.9508263Z 2025-05-07T20:31:58.9508543Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:58.9509247Z 2025-05-07T20:31:58.9509439Z x_sign = torch.sign(x) 2025-05-07T20:31:58.9509958Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:58.9510280Z x = x_sign * x_clamp 2025-05-07T20:31:58.9510524Z x0 = x[:, :D] 2025-05-07T20:31:58.9510740Z x1 = x[:, D:] 2025-05-07T20:31:58.9510956Z 2025-05-07T20:31:58.9511146Z if contiguous: 2025-05-07T20:31:58.9511378Z x0 = x0.contiguous() 2025-05-07T20:31:58.9511639Z x1 = x1.contiguous() 2025-05-07T20:31:58.9511881Z 2025-05-07T20:31:58.9512072Z if scale_ub is not None: 2025-05-07T20:31:58.9512349Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:58.9512690Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:58.9512992Z ) 2025-05-07T20:31:58.9513188Z else: 2025-05-07T20:31:58.9513402Z scale_ub_tensor = None 2025-05-07T20:31:58.9513650Z 2025-05-07T20:31:58.9513893Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:58.9514207Z op = silu_mul_quant 2025-05-07T20:31:58.9514462Z if compiled: 2025-05-07T20:31:58.9514714Z op = torch.compile(op) 2025-05-07T20:31:58.9515013Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:58.9515288Z 2025-05-07T20:31:58.9515476Z > y_fp8, y_scale = fn() 2025-05-07T20:31:58.9515646Z 2025-05-07T20:31:58.9515749Z moe/activation_test.py:117: 2025-05-07T20:31:58.9516047Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:58.9516375Z moe/activation_test.py:115: in fn 2025-05-07T20:31:58.9516660Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:58.9517352Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:58.9518036Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:58.9518581Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:58.9519264Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:58.9519928Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:58.9520504Z kernel = self.compile( 2025-05-07T20:31:58.9521045Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:58.9521720Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:58.9522107Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:58.9522340Z 2025-05-07T20:31:58.9532886Z self = 2025-05-07T20:31:58.9534021Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:58.9535447Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07bdf05870>} 2025-05-07T20:31:58.9536794Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:58.9537822Z context = 2025-05-07T20:31:58.9538126Z 2025-05-07T20:31:58.9538304Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:58.9538838Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:58.9539343Z module_map=module_map) 2025-05-07T20:31:58.9540017Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:58.9540388Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:58.9540744Z E ^ 2025-05-07T20:31:58.9541235Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:58.9541689Z 2025-05-07T20:31:58.9542116Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:58.9542633Z 2025-05-07T20:31:58.9542746Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:58.9543183Z self=, 2025-05-07T20:31:58.9543595Z T=1, 2025-05-07T20:31:58.9543789Z D=7168, 2025-05-07T20:31:58.9544001Z scale_ub=None, 2025-05-07T20:31:58.9544232Z contiguous=True, 2025-05-07T20:31:58.9544463Z compiled=True, 2025-05-07T20:31:58.9544685Z ) 2025-05-07T20:31:58.9545018Z self = 2025-05-07T20:31:58.9545525Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:58.9545789Z 2025-05-07T20:31:58.9545873Z @given( 2025-05-07T20:31:58.9546119Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:58.9546444Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:58.9546756Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:58.9547094Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:58.9547435Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:58.9547734Z ) 2025-05-07T20:31:58.9548097Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:58.9548553Z def test_silu_mul_quant( 2025-05-07T20:31:58.9548803Z self, 2025-05-07T20:31:58.9549013Z T: int, 2025-05-07T20:31:58.9549225Z D: int, 2025-05-07T20:31:58.9549459Z scale_ub: Optional[float], 2025-05-07T20:31:58.9549739Z contiguous: bool, 2025-05-07T20:31:58.9550001Z compiled: bool, 2025-05-07T20:31:58.9550286Z ) -> None: 2025-05-07T20:31:58.9550509Z torch.manual_seed(2025) 2025-05-07T20:31:58.9550764Z 2025-05-07T20:31:58.9551050Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:58.9551395Z 2025-05-07T20:31:58.9551603Z x_sign = torch.sign(x) 2025-05-07T20:31:58.9551905Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:58.9552221Z x = x_sign * x_clamp 2025-05-07T20:31:58.9552477Z x0 = x[:, :D] 2025-05-07T20:31:58.9552711Z x1 = x[:, D:] 2025-05-07T20:31:58.9552927Z 2025-05-07T20:31:58.9553128Z if contiguous: 2025-05-07T20:31:58.9553377Z x0 = x0.contiguous() 2025-05-07T20:31:58.9553645Z x1 = x1.contiguous() 2025-05-07T20:31:58.9553898Z 2025-05-07T20:31:58.9554109Z if scale_ub is not None: 2025-05-07T20:31:58.9554399Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:58.9554754Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:58.9555082Z ) 2025-05-07T20:31:58.9555292Z else: 2025-05-07T20:31:58.9555511Z scale_ub_tensor = None 2025-05-07T20:31:58.9555779Z 2025-05-07T20:31:58.9556030Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:58.9556355Z op = silu_mul_quant 2025-05-07T20:31:58.9556621Z if compiled: 2025-05-07T20:31:58.9556884Z op = torch.compile(op) 2025-05-07T20:31:58.9557189Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:58.9557476Z 2025-05-07T20:31:58.9557688Z y_fp8, y_scale = fn() 2025-05-07T20:31:58.9557980Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:58.9558285Z 2025-05-07T20:31:58.9558536Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:58.9558940Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:58.9559391Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:58.9559789Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:58.9560160Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:58.9560479Z 2025-05-07T20:31:58.9560683Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:58.9560884Z 2025-05-07T20:31:58.9560988Z moe/activation_test.py:126: 2025-05-07T20:31:58.9561293Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:58.9561637Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:58.9561962Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:58.9562760Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:58.9563543Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:58.9564108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:58.9564831Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:58.9565548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:58.9566302Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:58.9567082Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:58.9567861Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:58.9568622Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:58.9569295Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:58.9569917Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:58.9570462Z fn() 2025-05-07T20:31:58.9570999Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:58.9571600Z self.fn.run( 2025-05-07T20:31:58.9572086Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:58.9572639Z kernel = self.compile( 2025-05-07T20:31:58.9573205Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:58.9573883Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:58.9574299Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:58.9574536Z 2025-05-07T20:31:58.9574761Z self = 2025-05-07T20:31:58.9575901Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:58.9577330Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07bde8d870>} 2025-05-07T20:31:58.9578728Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:58.9579869Z context = 2025-05-07T20:31:58.9580201Z 2025-05-07T20:31:58.9580404Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:58.9580944Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:58.9581519Z module_map=module_map) 2025-05-07T20:31:58.9581974Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:58.9582351Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:58.9582624Z E ^ 2025-05-07T20:31:58.9583109Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:58.9583575Z 2025-05-07T20:31:58.9584016Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:58.9584552Z 2025-05-07T20:31:58.9584669Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:58.9585096Z self=, 2025-05-07T20:31:58.9585520Z T=4096, 2025-05-07T20:31:58.9585719Z D=5120, 2025-05-07T20:31:58.9585917Z scale_ub=None, 2025-05-07T20:31:58.9586145Z contiguous=False, 2025-05-07T20:31:58.9586390Z compiled=False, 2025-05-07T20:31:58.9586601Z ) 2025-05-07T20:31:59.5465970Z W0507 20:31:59.543000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:59.5467054Z W0507 20:31:59.543000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Traceback (most recent call last): 2025-05-07T20:31:59.5468395Z W0507 20:31:59.543000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:59.5469823Z W0507 20:31:59.543000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:59.5471195Z W0507 20:31:59.543000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:59.5472580Z W0507 20:31:59.543000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:59.5473878Z W0507 20:31:59.543000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:59.5475255Z W0507 20:31:59.543000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:59.5476666Z W0507 20:31:59.543000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:59.5477909Z W0507 20:31:59.543000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] generator.visit(fn.parse()) 2025-05-07T20:31:59.5479122Z W0507 20:31:59.543000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:59.5480350Z W0507 20:31:59.543000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ret = super().visit(node) 2025-05-07T20:31:59.5481406Z W0507 20:31:59.543000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:31:59.5482420Z W0507 20:31:59.543000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] return visitor(node) 2025-05-07T20:31:59.5484124Z W0507 20:31:59.543000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:59.5485397Z W0507 20:31:59.543000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:59.5486503Z W0507 20:31:59.543000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:31:59.5487546Z W0507 20:31:59.543000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] self.visit(item) 2025-05-07T20:31:59.5488715Z W0507 20:31:59.543000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:59.5490409Z W0507 20:31:59.543000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:59.5491478Z W0507 20:31:59.543000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:59.5492376Z W0507 20:31:59.543000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:59.5493106Z W0507 20:31:59.543000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^ 2025-05-07T20:31:59.5494104Z W0507 20:31:59.543000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:00.1605274Z W0507 20:32:00.157000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:00.1606448Z W0507 20:32:00.157000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Traceback (most recent call last): 2025-05-07T20:32:00.1607777Z W0507 20:32:00.157000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:00.1609187Z W0507 20:32:00.157000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:00.1610553Z W0507 20:32:00.157000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:00.1611938Z W0507 20:32:00.157000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:00.1613227Z W0507 20:32:00.157000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:00.1614577Z W0507 20:32:00.157000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:00.1615967Z W0507 20:32:00.157000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:00.1617666Z W0507 20:32:00.157000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] generator.visit(fn.parse()) 2025-05-07T20:32:00.1618882Z W0507 20:32:00.157000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:00.1620182Z W0507 20:32:00.157000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ret = super().visit(node) 2025-05-07T20:32:00.1621216Z W0507 20:32:00.157000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:32:00.1622222Z W0507 20:32:00.157000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] return visitor(node) 2025-05-07T20:32:00.1623447Z W0507 20:32:00.157000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:00.1624723Z W0507 20:32:00.157000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:00.1625830Z W0507 20:32:00.157000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:32:00.1626867Z W0507 20:32:00.157000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] self.visit(item) 2025-05-07T20:32:00.1628031Z W0507 20:32:00.157000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:00.1629386Z W0507 20:32:00.157000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:00.1630447Z W0507 20:32:00.157000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:00.1631404Z W0507 20:32:00.157000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:00.1632143Z W0507 20:32:00.157000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^ 2025-05-07T20:32:00.1633151Z W0507 20:32:00.157000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.3147071Z self = 2025-05-07T20:32:01.3147879Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:01.3148278Z 2025-05-07T20:32:01.3148389Z @given( 2025-05-07T20:32:01.3148717Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.3149129Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.3149537Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.3149966Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.3150297Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.3150589Z ) 2025-05-07T20:32:01.3151005Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.3151454Z def test_silu_mul_quant( 2025-05-07T20:32:01.3151694Z self, 2025-05-07T20:32:01.3151894Z T: int, 2025-05-07T20:32:01.3152092Z D: int, 2025-05-07T20:32:01.3152306Z scale_ub: Optional[float], 2025-05-07T20:32:01.3152960Z contiguous: bool, 2025-05-07T20:32:01.3153205Z compiled: bool, 2025-05-07T20:32:01.3153571Z ) -> None: 2025-05-07T20:32:01.3153795Z torch.manual_seed(2025) 2025-05-07T20:32:01.3154040Z 2025-05-07T20:32:01.3154311Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.3154661Z 2025-05-07T20:32:01.3154860Z x_sign = torch.sign(x) 2025-05-07T20:32:01.3155147Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.3155460Z x = x_sign * x_clamp 2025-05-07T20:32:01.3155709Z x0 = x[:, :D] 2025-05-07T20:32:01.3155921Z x1 = x[:, D:] 2025-05-07T20:32:01.3156133Z 2025-05-07T20:32:01.3156326Z if contiguous: 2025-05-07T20:32:01.3156561Z x0 = x0.contiguous() 2025-05-07T20:32:01.3156819Z x1 = x1.contiguous() 2025-05-07T20:32:01.3157065Z 2025-05-07T20:32:01.3157263Z if scale_ub is not None: 2025-05-07T20:32:01.3157541Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.3157881Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.3158201Z ) 2025-05-07T20:32:01.3158392Z else: 2025-05-07T20:32:01.3158611Z scale_ub_tensor = None 2025-05-07T20:32:01.3158861Z 2025-05-07T20:32:01.3159094Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.3159411Z op = silu_mul_quant 2025-05-07T20:32:01.3159666Z if compiled: 2025-05-07T20:32:01.3159920Z op = torch.compile(op) 2025-05-07T20:32:01.3160223Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.3160499Z 2025-05-07T20:32:01.3160693Z > y_fp8, y_scale = fn() 2025-05-07T20:32:01.3160865Z 2025-05-07T20:32:01.3160968Z moe/activation_test.py:117: 2025-05-07T20:32:01.3161267Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.3161604Z moe/activation_test.py:115: in fn 2025-05-07T20:32:01.3161893Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.3162590Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:01.3163289Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:01.3163825Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.3164510Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.3165175Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.3165720Z kernel = self.compile( 2025-05-07T20:32:01.3166261Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.3166921Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.3167325Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.3167550Z 2025-05-07T20:32:01.3167771Z self = 2025-05-07T20:32:01.3168850Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.3170235Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07bde8eb90>} 2025-05-07T20:32:01.3171575Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.3172597Z context = 2025-05-07T20:32:01.3172975Z 2025-05-07T20:32:01.3173144Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.3173737Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.3174214Z module_map=module_map) 2025-05-07T20:32:01.3174583Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.3174934Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.3175190Z E ^ 2025-05-07T20:32:01.3175657Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.3176101Z 2025-05-07T20:32:01.3176518Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.3177032Z 2025-05-07T20:32:01.3177135Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.3177548Z self=, 2025-05-07T20:32:01.3177952Z T=4096, 2025-05-07T20:32:01.3178137Z D=7168, 2025-05-07T20:32:01.3178338Z scale_ub=None, 2025-05-07T20:32:01.3178568Z contiguous=False, 2025-05-07T20:32:01.3178792Z compiled=False, 2025-05-07T20:32:01.3179001Z ) 2025-05-07T20:32:01.3179322Z self = 2025-05-07T20:32:01.3179988Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:01.3180286Z 2025-05-07T20:32:01.3180364Z @given( 2025-05-07T20:32:01.3180601Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.3180909Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.3181223Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.3181560Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.3181892Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.3182171Z ) 2025-05-07T20:32:01.3182536Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.3182986Z def test_silu_mul_quant( 2025-05-07T20:32:01.3183227Z self, 2025-05-07T20:32:01.3183430Z T: int, 2025-05-07T20:32:01.3183635Z D: int, 2025-05-07T20:32:01.3183855Z scale_ub: Optional[float], 2025-05-07T20:32:01.3184135Z contiguous: bool, 2025-05-07T20:32:01.3184382Z compiled: bool, 2025-05-07T20:32:01.3184603Z ) -> None: 2025-05-07T20:32:01.3184827Z torch.manual_seed(2025) 2025-05-07T20:32:01.3185077Z 2025-05-07T20:32:01.3185350Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.3185699Z 2025-05-07T20:32:01.3185896Z x_sign = torch.sign(x) 2025-05-07T20:32:01.3186182Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.3186493Z x = x_sign * x_clamp 2025-05-07T20:32:01.3186737Z x0 = x[:, :D] 2025-05-07T20:32:01.3186965Z x1 = x[:, D:] 2025-05-07T20:32:01.3187168Z 2025-05-07T20:32:01.3187366Z if contiguous: 2025-05-07T20:32:01.3187606Z x0 = x0.contiguous() 2025-05-07T20:32:01.3187862Z x1 = x1.contiguous() 2025-05-07T20:32:01.3188106Z 2025-05-07T20:32:01.3188306Z if scale_ub is not None: 2025-05-07T20:32:01.3188576Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.3188917Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.3189234Z ) 2025-05-07T20:32:01.3189421Z else: 2025-05-07T20:32:01.3189638Z scale_ub_tensor = None 2025-05-07T20:32:01.3190304Z 2025-05-07T20:32:01.3190542Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.3190859Z op = silu_mul_quant 2025-05-07T20:32:01.3191113Z if compiled: 2025-05-07T20:32:01.3191359Z op = torch.compile(op) 2025-05-07T20:32:01.3191658Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.3192168Z 2025-05-07T20:32:01.3192363Z > y_fp8, y_scale = fn() 2025-05-07T20:32:01.3192529Z 2025-05-07T20:32:01.3192738Z moe/activation_test.py:117: 2025-05-07T20:32:01.3193041Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.3193375Z moe/activation_test.py:115: in fn 2025-05-07T20:32:01.3193659Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.3194350Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:01.3195044Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:01.3195586Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.3196262Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.3196930Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.3197467Z kernel = self.compile( 2025-05-07T20:32:01.3198008Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.3198673Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.3199082Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.3199308Z 2025-05-07T20:32:01.3199528Z self = 2025-05-07T20:32:01.3200600Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.3201986Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07bde8eb00>} 2025-05-07T20:32:01.3203338Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.3204360Z context = 2025-05-07T20:32:01.3204646Z 2025-05-07T20:32:01.3204819Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.3205332Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.3205798Z module_map=module_map) 2025-05-07T20:32:01.3206166Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.3206515Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.3206771Z E ^ 2025-05-07T20:32:01.3207235Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.3207692Z 2025-05-07T20:32:01.3208119Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.3208626Z 2025-05-07T20:32:01.3208732Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.3209145Z self=, 2025-05-07T20:32:01.3209542Z T=128, 2025-05-07T20:32:01.3209726Z D=7168, 2025-05-07T20:32:01.3209922Z scale_ub=None, 2025-05-07T20:32:01.3210144Z contiguous=False, 2025-05-07T20:32:01.3210370Z compiled=True, 2025-05-07T20:32:01.3210578Z ) 2025-05-07T20:32:01.3848055Z self = 2025-05-07T20:32:01.3848787Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:01.3849169Z 2025-05-07T20:32:01.3849298Z @given( 2025-05-07T20:32:01.3849553Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.3850228Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.3850538Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.3850997Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.3851337Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.3851631Z ) 2025-05-07T20:32:01.3851981Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.3852427Z def test_silu_mul_quant( 2025-05-07T20:32:01.3852677Z self, 2025-05-07T20:32:01.3852873Z T: int, 2025-05-07T20:32:01.3853077Z D: int, 2025-05-07T20:32:01.3853302Z scale_ub: Optional[float], 2025-05-07T20:32:01.3853580Z contiguous: bool, 2025-05-07T20:32:01.3853826Z compiled: bool, 2025-05-07T20:32:01.3854061Z ) -> None: 2025-05-07T20:32:01.3854286Z torch.manual_seed(2025) 2025-05-07T20:32:01.3854528Z 2025-05-07T20:32:01.3854807Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.3855166Z 2025-05-07T20:32:01.3855362Z x_sign = torch.sign(x) 2025-05-07T20:32:01.3855665Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.3855981Z x = x_sign * x_clamp 2025-05-07T20:32:01.3856222Z x0 = x[:, :D] 2025-05-07T20:32:01.3856444Z x1 = x[:, D:] 2025-05-07T20:32:01.3856665Z 2025-05-07T20:32:01.3856851Z if contiguous: 2025-05-07T20:32:01.3857094Z x0 = x0.contiguous() 2025-05-07T20:32:01.3857359Z x1 = x1.contiguous() 2025-05-07T20:32:01.3857597Z 2025-05-07T20:32:01.3857800Z if scale_ub is not None: 2025-05-07T20:32:01.3858077Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.3858436Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.3866127Z ) 2025-05-07T20:32:01.3866348Z else: 2025-05-07T20:32:01.3866579Z scale_ub_tensor = None 2025-05-07T20:32:01.3866854Z 2025-05-07T20:32:01.3867113Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.3867445Z op = silu_mul_quant 2025-05-07T20:32:01.3867720Z if compiled: 2025-05-07T20:32:01.3867977Z op = torch.compile(op) 2025-05-07T20:32:01.3868287Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.3868574Z 2025-05-07T20:32:01.3868774Z y_fp8, y_scale = fn() 2025-05-07T20:32:01.3869077Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:01.3869378Z 2025-05-07T20:32:01.3869616Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.3869958Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:01.3870255Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:01.3870572Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:01.3870930Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:01.3871243Z 2025-05-07T20:32:01.3871455Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:01.3871650Z 2025-05-07T20:32:01.3871753Z moe/activation_test.py:126: 2025-05-07T20:32:01.3872063Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.3872404Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:01.3872731Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:01.3873528Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:01.3874283Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:01.3874837Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.3875519Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.3876220Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:01.3877077Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:01.3877909Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:01.3878666Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:01.3879400Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:01.3880049Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:01.3880658Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:01.3881228Z fn() 2025-05-07T20:32:01.3881742Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:01.3882341Z self.fn.run( 2025-05-07T20:32:01.3882815Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.3883357Z kernel = self.compile( 2025-05-07T20:32:01.3883915Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.3884579Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.3884975Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.3885210Z 2025-05-07T20:32:01.3885426Z self = 2025-05-07T20:32:01.3886512Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.3887900Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07bde8fac0>} 2025-05-07T20:32:01.3889253Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.3890597Z context = 2025-05-07T20:32:01.3890897Z 2025-05-07T20:32:01.3891070Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.3891604Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.3892073Z module_map=module_map) 2025-05-07T20:32:01.3892451Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.3892818Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:01.3893085Z E ^ 2025-05-07T20:32:01.3893570Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.3894026Z 2025-05-07T20:32:01.3894458Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.3894971Z 2025-05-07T20:32:01.3895088Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.3895499Z self=, 2025-05-07T20:32:01.3895910Z T=128, 2025-05-07T20:32:01.3896112Z D=7168, 2025-05-07T20:32:01.3896306Z scale_ub=None, 2025-05-07T20:32:01.3896536Z contiguous=False, 2025-05-07T20:32:01.3896775Z compiled=False, 2025-05-07T20:32:01.3896991Z ) 2025-05-07T20:32:01.7577991Z self = 2025-05-07T20:32:01.7578787Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:01.7579180Z 2025-05-07T20:32:01.7579663Z @given( 2025-05-07T20:32:01.7580180Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.7580689Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.7581005Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.7581343Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.7581677Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.7581970Z ) 2025-05-07T20:32:01.7582329Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.7582773Z def test_silu_mul_quant( 2025-05-07T20:32:01.7583027Z self, 2025-05-07T20:32:01.7583233Z T: int, 2025-05-07T20:32:01.7583434Z D: int, 2025-05-07T20:32:01.7583663Z scale_ub: Optional[float], 2025-05-07T20:32:01.7583946Z contiguous: bool, 2025-05-07T20:32:01.7584186Z compiled: bool, 2025-05-07T20:32:01.7584425Z ) -> None: 2025-05-07T20:32:01.7584650Z torch.manual_seed(2025) 2025-05-07T20:32:01.7584913Z 2025-05-07T20:32:01.7585188Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.7585547Z 2025-05-07T20:32:01.7585749Z x_sign = torch.sign(x) 2025-05-07T20:32:01.7586046Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.7586366Z x = x_sign * x_clamp 2025-05-07T20:32:01.7586616Z x0 = x[:, :D] 2025-05-07T20:32:01.7586836Z x1 = x[:, D:] 2025-05-07T20:32:01.7587053Z 2025-05-07T20:32:01.7587248Z if contiguous: 2025-05-07T20:32:01.7587486Z x0 = x0.contiguous() 2025-05-07T20:32:01.7587756Z x1 = x1.contiguous() 2025-05-07T20:32:01.7588001Z 2025-05-07T20:32:01.7588198Z if scale_ub is not None: 2025-05-07T20:32:01.7588485Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.7588833Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.7589145Z ) 2025-05-07T20:32:01.7589361Z else: 2025-05-07T20:32:01.7589592Z scale_ub_tensor = None 2025-05-07T20:32:01.7590094Z 2025-05-07T20:32:01.7590337Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.7590662Z op = silu_mul_quant 2025-05-07T20:32:01.7590925Z if compiled: 2025-05-07T20:32:01.7591182Z op = torch.compile(op) 2025-05-07T20:32:01.7591488Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.7591776Z 2025-05-07T20:32:01.7591973Z > y_fp8, y_scale = fn() 2025-05-07T20:32:01.7592152Z 2025-05-07T20:32:01.7592256Z moe/activation_test.py:117: 2025-05-07T20:32:01.7592565Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.7592900Z moe/activation_test.py:115: in fn 2025-05-07T20:32:01.7593189Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.7593889Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:01.7594596Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:01.7595141Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.7595839Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.7596509Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.7597050Z kernel = self.compile( 2025-05-07T20:32:01.7597604Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.7598271Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.7598675Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.7598905Z 2025-05-07T20:32:01.7599119Z self = 2025-05-07T20:32:01.7600444Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.7601841Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07bda38f70>} 2025-05-07T20:32:01.7603194Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.7604227Z context = 2025-05-07T20:32:01.7604519Z 2025-05-07T20:32:01.7604689Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.7605219Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.7605699Z module_map=module_map) 2025-05-07T20:32:01.7606075Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.7606436Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.7606707Z E ^ 2025-05-07T20:32:01.7607185Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.7607636Z 2025-05-07T20:32:01.7608055Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.7608579Z 2025-05-07T20:32:01.7608689Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.7609109Z self=, 2025-05-07T20:32:01.7609517Z T=4096, 2025-05-07T20:32:01.7609713Z D=5120, 2025-05-07T20:32:01.7609916Z scale_ub=1200.0, 2025-05-07T20:32:01.7610154Z contiguous=True, 2025-05-07T20:32:01.7610390Z compiled=False, 2025-05-07T20:32:01.7610612Z ) 2025-05-07T20:32:01.7610948Z self = 2025-05-07T20:32:01.7611453Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:01.7611734Z 2025-05-07T20:32:01.7611820Z @given( 2025-05-07T20:32:01.7612066Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.7612386Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.7612708Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.7613046Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.7613385Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.7613674Z ) 2025-05-07T20:32:01.7614034Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.7614485Z def test_silu_mul_quant( 2025-05-07T20:32:01.7614733Z self, 2025-05-07T20:32:01.7614942Z T: int, 2025-05-07T20:32:01.7615151Z D: int, 2025-05-07T20:32:01.7615377Z scale_ub: Optional[float], 2025-05-07T20:32:01.7615667Z contiguous: bool, 2025-05-07T20:32:01.7615916Z compiled: bool, 2025-05-07T20:32:01.7616142Z ) -> None: 2025-05-07T20:32:01.7616369Z torch.manual_seed(2025) 2025-05-07T20:32:01.7616621Z 2025-05-07T20:32:01.7616898Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.7617246Z 2025-05-07T20:32:01.7617450Z x_sign = torch.sign(x) 2025-05-07T20:32:01.7617746Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.7618067Z x = x_sign * x_clamp 2025-05-07T20:32:01.7618319Z x0 = x[:, :D] 2025-05-07T20:32:01.7618546Z x1 = x[:, D:] 2025-05-07T20:32:01.7618757Z 2025-05-07T20:32:01.7618959Z if contiguous: 2025-05-07T20:32:01.7619206Z x0 = x0.contiguous() 2025-05-07T20:32:01.7619474Z x1 = x1.contiguous() 2025-05-07T20:32:01.7619911Z 2025-05-07T20:32:01.7620116Z if scale_ub is not None: 2025-05-07T20:32:01.7620500Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.7620845Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.7621164Z ) 2025-05-07T20:32:01.7621365Z else: 2025-05-07T20:32:01.7621589Z scale_ub_tensor = None 2025-05-07T20:32:01.7621850Z 2025-05-07T20:32:01.7622087Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.7622412Z op = silu_mul_quant 2025-05-07T20:32:01.7622678Z if compiled: 2025-05-07T20:32:01.7622929Z op = torch.compile(op) 2025-05-07T20:32:01.7623237Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.7623528Z 2025-05-07T20:32:01.7623725Z > y_fp8, y_scale = fn() 2025-05-07T20:32:01.7623905Z 2025-05-07T20:32:01.7624013Z moe/activation_test.py:117: 2025-05-07T20:32:01.7624320Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.7624671Z moe/activation_test.py:115: in fn 2025-05-07T20:32:01.7624963Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.7625664Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:01.7626370Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:01.7626913Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.7627606Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.7628283Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.7628827Z kernel = self.compile( 2025-05-07T20:32:01.7629373Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.7630044Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.7630453Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.7630685Z 2025-05-07T20:32:01.7630911Z self = 2025-05-07T20:32:01.7632043Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.7633421Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07bda39510>} 2025-05-07T20:32:01.7634775Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.7635813Z context = 2025-05-07T20:32:01.7636111Z 2025-05-07T20:32:01.7636284Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.7636815Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.7637297Z module_map=module_map) 2025-05-07T20:32:01.7637675Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.7638039Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.7638311Z E ^ 2025-05-07T20:32:01.7638786Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.7639236Z 2025-05-07T20:32:01.7639663Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.7640273Z 2025-05-07T20:32:01.7640381Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.7640874Z self=, 2025-05-07T20:32:01.7641290Z T=1, 2025-05-07T20:32:01.7641477Z D=5120, 2025-05-07T20:32:01.7641684Z scale_ub=None, 2025-05-07T20:32:01.7641912Z contiguous=True, 2025-05-07T20:32:01.7642137Z compiled=True, 2025-05-07T20:32:01.7642347Z ) 2025-05-07T20:32:02.2212270Z W0507 20:32:02.217000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:02.2213351Z W0507 20:32:02.217000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Traceback (most recent call last): 2025-05-07T20:32:02.2214694Z W0507 20:32:02.217000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:02.2216152Z W0507 20:32:02.217000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:02.2217520Z W0507 20:32:02.217000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:02.2218888Z W0507 20:32:02.217000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:02.2220289Z W0507 20:32:02.217000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:02.2221666Z W0507 20:32:02.217000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:02.2223063Z W0507 20:32:02.217000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:02.2224297Z W0507 20:32:02.217000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] generator.visit(fn.parse()) 2025-05-07T20:32:02.2225506Z W0507 20:32:02.217000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:02.2226719Z W0507 20:32:02.217000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ret = super().visit(node) 2025-05-07T20:32:02.2227754Z W0507 20:32:02.217000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:32:02.2228760Z W0507 20:32:02.217000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] return visitor(node) 2025-05-07T20:32:02.2229960Z W0507 20:32:02.217000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:02.2231268Z W0507 20:32:02.217000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:02.2232378Z W0507 20:32:02.217000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:32:02.2233906Z W0507 20:32:02.217000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] self.visit(item) 2025-05-07T20:32:02.2235078Z W0507 20:32:02.217000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:02.2236421Z W0507 20:32:02.217000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:02.2237474Z W0507 20:32:02.217000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:02.2238369Z W0507 20:32:02.217000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:02.2239113Z W0507 20:32:02.217000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^ 2025-05-07T20:32:02.2240126Z W0507 20:32:02.217000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:02.3829515Z W0507 20:32:02.379000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:02.3831004Z W0507 20:32:02.379000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Traceback (most recent call last): 2025-05-07T20:32:02.3832471Z W0507 20:32:02.379000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:02.3833948Z W0507 20:32:02.379000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:02.3835317Z W0507 20:32:02.379000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:02.3836700Z W0507 20:32:02.379000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:02.3837993Z W0507 20:32:02.379000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:02.3839356Z W0507 20:32:02.379000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:02.3840755Z W0507 20:32:02.379000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:02.3841988Z W0507 20:32:02.379000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] generator.visit(fn.parse()) 2025-05-07T20:32:02.3843208Z W0507 20:32:02.379000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:02.3844404Z W0507 20:32:02.379000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ret = super().visit(node) 2025-05-07T20:32:02.3845912Z W0507 20:32:02.379000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:32:02.3846925Z W0507 20:32:02.379000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] return visitor(node) 2025-05-07T20:32:02.3848132Z W0507 20:32:02.379000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:02.3849400Z W0507 20:32:02.379000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:02.3850503Z W0507 20:32:02.379000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:32:02.3851556Z W0507 20:32:02.379000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] self.visit(item) 2025-05-07T20:32:02.3852717Z W0507 20:32:02.379000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:02.3854064Z W0507 20:32:02.379000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:02.3855114Z W0507 20:32:02.379000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:02.3856015Z W0507 20:32:02.379000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:02.3856743Z W0507 20:32:02.379000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^ 2025-05-07T20:32:02.3857769Z W0507 20:32:02.379000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:02.8226778Z self = 2025-05-07T20:32:02.8227437Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:02.8227697Z 2025-05-07T20:32:02.8227786Z @given( 2025-05-07T20:32:02.8228018Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:02.8228338Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:02.8228652Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:02.8228990Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:02.8229316Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:02.8229613Z ) 2025-05-07T20:32:02.8230002Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:02.8230452Z def test_silu_mul_quant( 2025-05-07T20:32:02.8230698Z self, 2025-05-07T20:32:02.8230900Z T: int, 2025-05-07T20:32:02.8231096Z D: int, 2025-05-07T20:32:02.8231348Z scale_ub: Optional[float], 2025-05-07T20:32:02.8231647Z contiguous: bool, 2025-05-07T20:32:02.8231885Z compiled: bool, 2025-05-07T20:32:02.8232121Z ) -> None: 2025-05-07T20:32:02.8232342Z torch.manual_seed(2025) 2025-05-07T20:32:02.8232580Z 2025-05-07T20:32:02.8232862Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:02.8233206Z 2025-05-07T20:32:02.8233409Z x_sign = torch.sign(x) 2025-05-07T20:32:02.8233702Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:02.8234022Z x = x_sign * x_clamp 2025-05-07T20:32:02.8234276Z x0 = x[:, :D] 2025-05-07T20:32:02.8234903Z x1 = x[:, D:] 2025-05-07T20:32:02.8235117Z 2025-05-07T20:32:02.8235309Z if contiguous: 2025-05-07T20:32:02.8235680Z x0 = x0.contiguous() 2025-05-07T20:32:02.8235949Z x1 = x1.contiguous() 2025-05-07T20:32:02.8236193Z 2025-05-07T20:32:02.8236389Z if scale_ub is not None: 2025-05-07T20:32:02.8236673Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:02.8237014Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:02.8237320Z ) 2025-05-07T20:32:02.8237520Z else: 2025-05-07T20:32:02.8237743Z scale_ub_tensor = None 2025-05-07T20:32:02.8237990Z 2025-05-07T20:32:02.8238233Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:02.8238551Z op = silu_mul_quant 2025-05-07T20:32:02.8238819Z if compiled: 2025-05-07T20:32:02.8239073Z op = torch.compile(op) 2025-05-07T20:32:02.8239375Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:02.8239666Z 2025-05-07T20:32:02.8239861Z y_fp8, y_scale = fn() 2025-05-07T20:32:02.8240163Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:02.8240461Z 2025-05-07T20:32:02.8240699Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:02.8241045Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:02.8241347Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:02.8241662Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:02.8242029Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:02.8242348Z 2025-05-07T20:32:02.8242549Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:02.8242754Z 2025-05-07T20:32:02.8242858Z moe/activation_test.py:126: 2025-05-07T20:32:02.8243162Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:02.8243498Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:02.8243827Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:02.8244623Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:02.8245375Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:02.8245932Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:02.8246611Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:02.8247300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:02.8248024Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:02.8248777Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:02.8249523Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:02.8250261Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:02.8250905Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:02.8251557Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:02.8252078Z fn() 2025-05-07T20:32:02.8252598Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:02.8253178Z self.fn.run( 2025-05-07T20:32:02.8253641Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:02.8254176Z kernel = self.compile( 2025-05-07T20:32:02.8254716Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:02.8255465Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:02.8255939Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:02.8256174Z 2025-05-07T20:32:02.8256384Z self = 2025-05-07T20:32:02.8257469Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:02.8258864Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07bdf070a0>} 2025-05-07T20:32:02.8260315Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:02.8261374Z context = 2025-05-07T20:32:02.8261693Z 2025-05-07T20:32:02.8261868Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:02.8262410Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:02.8269776Z module_map=module_map) 2025-05-07T20:32:02.8270185Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:02.8270562Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:02.8270844Z E ^ 2025-05-07T20:32:02.8271330Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:02.8271789Z 2025-05-07T20:32:02.8272218Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:02.8272746Z 2025-05-07T20:32:02.8272864Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:02.8273288Z self=, 2025-05-07T20:32:02.8273700Z T=2048, 2025-05-07T20:32:02.8273908Z D=5120, 2025-05-07T20:32:02.8274111Z scale_ub=None, 2025-05-07T20:32:02.8274341Z contiguous=True, 2025-05-07T20:32:02.8274577Z compiled=True, 2025-05-07T20:32:02.8274793Z ) 2025-05-07T20:32:03.2400274Z W0507 20:32:03.236000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:03.2401381Z W0507 20:32:03.236000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Traceback (most recent call last): 2025-05-07T20:32:03.2402737Z W0507 20:32:03.236000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:03.2404203Z W0507 20:32:03.236000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:03.2405580Z W0507 20:32:03.236000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:03.2406961Z W0507 20:32:03.236000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:03.2408254Z W0507 20:32:03.236000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:03.2410117Z W0507 20:32:03.236000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:03.2411528Z W0507 20:32:03.236000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:03.2412751Z W0507 20:32:03.236000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] generator.visit(fn.parse()) 2025-05-07T20:32:03.2413966Z W0507 20:32:03.236000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:03.2415163Z W0507 20:32:03.236000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ret = super().visit(node) 2025-05-07T20:32:03.2416205Z W0507 20:32:03.236000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:32:03.2417217Z W0507 20:32:03.236000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] return visitor(node) 2025-05-07T20:32:03.2418416Z W0507 20:32:03.236000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:03.2419691Z W0507 20:32:03.236000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:03.2420886Z W0507 20:32:03.236000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:32:03.2421931Z W0507 20:32:03.236000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] self.visit(item) 2025-05-07T20:32:03.2423093Z W0507 20:32:03.236000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:03.2424437Z W0507 20:32:03.236000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:03.2425489Z W0507 20:32:03.236000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:03.2426402Z W0507 20:32:03.236000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:03.2427146Z W0507 20:32:03.236000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^ 2025-05-07T20:32:03.2428156Z W0507 20:32:03.236000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:03.4004640Z W0507 20:32:03.397000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:03.4006021Z W0507 20:32:03.397000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Traceback (most recent call last): 2025-05-07T20:32:03.4007363Z W0507 20:32:03.397000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:03.4009320Z W0507 20:32:03.397000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:03.4010695Z W0507 20:32:03.397000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:03.4012061Z W0507 20:32:03.397000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:03.4013379Z W0507 20:32:03.397000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:03.4015692Z W0507 20:32:03.397000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:03.4017116Z W0507 20:32:03.397000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:03.4018356Z W0507 20:32:03.397000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] generator.visit(fn.parse()) 2025-05-07T20:32:03.4019558Z W0507 20:32:03.397000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:03.4020836Z W0507 20:32:03.397000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ret = super().visit(node) 2025-05-07T20:32:03.4021888Z W0507 20:32:03.397000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:32:03.4022912Z W0507 20:32:03.397000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] return visitor(node) 2025-05-07T20:32:03.4024122Z W0507 20:32:03.397000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:03.4025394Z W0507 20:32:03.397000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:03.4026504Z W0507 20:32:03.397000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:32:03.4027547Z W0507 20:32:03.397000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] self.visit(item) 2025-05-07T20:32:03.4028720Z W0507 20:32:03.397000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:03.4030083Z W0507 20:32:03.397000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:03.4031133Z W0507 20:32:03.397000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:03.4032043Z W0507 20:32:03.397000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:03.4032786Z W0507 20:32:03.397000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^ 2025-05-07T20:32:03.4033970Z W0507 20:32:03.397000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:03.8386108Z self = 2025-05-07T20:32:03.8386675Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:03.8386954Z 2025-05-07T20:32:03.8387039Z @given( 2025-05-07T20:32:03.8387288Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:03.8387613Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:03.8387924Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:03.8388266Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:03.8388607Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:03.8388902Z ) 2025-05-07T20:32:03.8389296Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:03.8389762Z def test_silu_mul_quant( 2025-05-07T20:32:03.8390263Z self, 2025-05-07T20:32:03.8390478Z T: int, 2025-05-07T20:32:03.8390689Z D: int, 2025-05-07T20:32:03.8390923Z scale_ub: Optional[float], 2025-05-07T20:32:03.8391199Z contiguous: bool, 2025-05-07T20:32:03.8391459Z compiled: bool, 2025-05-07T20:32:03.8391702Z ) -> None: 2025-05-07T20:32:03.8391922Z torch.manual_seed(2025) 2025-05-07T20:32:03.8392180Z 2025-05-07T20:32:03.8392466Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:03.8392813Z 2025-05-07T20:32:03.8393019Z x_sign = torch.sign(x) 2025-05-07T20:32:03.8393320Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:03.8393630Z x = x_sign * x_clamp 2025-05-07T20:32:03.8393883Z x0 = x[:, :D] 2025-05-07T20:32:03.8394120Z x1 = x[:, D:] 2025-05-07T20:32:03.8394331Z 2025-05-07T20:32:03.8394527Z if contiguous: 2025-05-07T20:32:03.8394776Z x0 = x0.contiguous() 2025-05-07T20:32:03.8395040Z x1 = x1.contiguous() 2025-05-07T20:32:03.8395291Z 2025-05-07T20:32:03.8395498Z if scale_ub is not None: 2025-05-07T20:32:03.8395776Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:03.8396123Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:03.8396438Z ) 2025-05-07T20:32:03.8396642Z else: 2025-05-07T20:32:03.8396860Z scale_ub_tensor = None 2025-05-07T20:32:03.8397119Z 2025-05-07T20:32:03.8397359Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:03.8397675Z op = silu_mul_quant 2025-05-07T20:32:03.8397936Z if compiled: 2025-05-07T20:32:03.8398195Z op = torch.compile(op) 2025-05-07T20:32:03.8398494Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:03.8398781Z 2025-05-07T20:32:03.8398989Z y_fp8, y_scale = fn() 2025-05-07T20:32:03.8399283Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:03.8399582Z 2025-05-07T20:32:03.8399830Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:03.8400168Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:03.8400477Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:03.8400803Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:03.8401171Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:03.8401492Z 2025-05-07T20:32:03.8401743Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:03.8401947Z 2025-05-07T20:32:03.8402060Z moe/activation_test.py:126: 2025-05-07T20:32:03.8402368Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:03.8402720Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:03.8403436Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:03.8404384Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:03.8405161Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:03.8405718Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:03.8406416Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:03.8407107Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:03.8407835Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:03.8408596Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:03.8409357Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:03.8410098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:03.8410744Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:03.8411352Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:03.8411877Z fn() 2025-05-07T20:32:03.8412390Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:03.8412975Z self.fn.run( 2025-05-07T20:32:03.8413454Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:03.8413983Z kernel = self.compile( 2025-05-07T20:32:03.8414530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:03.8415196Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:03.8415603Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:03.8415834Z 2025-05-07T20:32:03.8416050Z self = 2025-05-07T20:32:03.8417129Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:03.8418533Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07bd48a7a0>} 2025-05-07T20:32:03.8419960Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:03.8420984Z context = 2025-05-07T20:32:03.8421284Z 2025-05-07T20:32:03.8421456Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:03.8421984Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:03.8422455Z module_map=module_map) 2025-05-07T20:32:03.8422824Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:03.8423187Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:03.8424980Z E ^ 2025-05-07T20:32:03.8425446Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:03.8425901Z 2025-05-07T20:32:03.8426316Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:03.8426934Z 2025-05-07T20:32:03.8427043Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:03.8427534Z self=, 2025-05-07T20:32:03.8427937Z T=128, 2025-05-07T20:32:03.8428139Z D=5120, 2025-05-07T20:32:03.8428347Z scale_ub=None, 2025-05-07T20:32:03.8428567Z contiguous=True, 2025-05-07T20:32:03.8428805Z compiled=True, 2025-05-07T20:32:03.8429025Z ) 2025-05-07T20:32:04.3109045Z W0507 20:32:04.307000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:04.3110145Z W0507 20:32:04.307000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Traceback (most recent call last): 2025-05-07T20:32:04.3111478Z W0507 20:32:04.307000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:04.3112952Z W0507 20:32:04.307000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:04.3114316Z W0507 20:32:04.307000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:04.3115688Z W0507 20:32:04.307000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:04.3116988Z W0507 20:32:04.307000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:04.3118358Z W0507 20:32:04.307000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:04.3119770Z W0507 20:32:04.307000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:04.3121003Z W0507 20:32:04.307000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] generator.visit(fn.parse()) 2025-05-07T20:32:04.3122256Z W0507 20:32:04.307000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:04.3123464Z W0507 20:32:04.307000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ret = super().visit(node) 2025-05-07T20:32:04.3124500Z W0507 20:32:04.307000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:32:04.3125509Z W0507 20:32:04.307000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] return visitor(node) 2025-05-07T20:32:04.3126813Z W0507 20:32:04.307000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:04.3128097Z W0507 20:32:04.307000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:04.3129212Z W0507 20:32:04.307000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:32:04.3130728Z W0507 20:32:04.307000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] self.visit(item) 2025-05-07T20:32:04.3131906Z W0507 20:32:04.307000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:04.3133270Z W0507 20:32:04.307000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:04.3134331Z W0507 20:32:04.307000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:04.3135244Z W0507 20:32:04.307000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:04.3135996Z W0507 20:32:04.307000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^ 2025-05-07T20:32:04.3137020Z W0507 20:32:04.307000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:04.4740773Z W0507 20:32:04.470000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:04.4742283Z W0507 20:32:04.470000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Traceback (most recent call last): 2025-05-07T20:32:04.4743606Z W0507 20:32:04.470000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:04.4745059Z W0507 20:32:04.470000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:04.4746427Z W0507 20:32:04.470000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:04.4747790Z W0507 20:32:04.470000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:04.4749431Z W0507 20:32:04.470000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:04.4751164Z W0507 20:32:04.470000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:04.4752958Z W0507 20:32:04.470000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:04.4754521Z W0507 20:32:04.470000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] generator.visit(fn.parse()) 2025-05-07T20:32:04.4756053Z W0507 20:32:04.470000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:04.4757558Z W0507 20:32:04.470000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ret = super().visit(node) 2025-05-07T20:32:04.4759025Z W0507 20:32:04.470000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:32:04.4760047Z W0507 20:32:04.470000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] return visitor(node) 2025-05-07T20:32:04.4761258Z W0507 20:32:04.470000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:04.4762518Z W0507 20:32:04.470000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:04.4763625Z W0507 20:32:04.470000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:32:04.4764669Z W0507 20:32:04.470000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] self.visit(item) 2025-05-07T20:32:04.4765838Z W0507 20:32:04.470000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:04.4767193Z W0507 20:32:04.470000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:04.4768240Z W0507 20:32:04.470000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:04.4769146Z W0507 20:32:04.470000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:04.4769884Z W0507 20:32:04.470000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^ 2025-05-07T20:32:04.4770904Z W0507 20:32:04.470000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.2250208Z self = 2025-05-07T20:32:05.2250786Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:05.2251056Z 2025-05-07T20:32:05.2251137Z @given( 2025-05-07T20:32:05.2251372Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.2251678Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.2251992Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.2252328Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.2252653Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.2252945Z ) 2025-05-07T20:32:05.2253327Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.2253774Z def test_silu_mul_quant( 2025-05-07T20:32:05.2254021Z self, 2025-05-07T20:32:05.2254218Z T: int, 2025-05-07T20:32:05.2254421Z D: int, 2025-05-07T20:32:05.2254639Z scale_ub: Optional[float], 2025-05-07T20:32:05.2254922Z contiguous: bool, 2025-05-07T20:32:05.2255165Z compiled: bool, 2025-05-07T20:32:05.2255393Z ) -> None: 2025-05-07T20:32:05.2255613Z torch.manual_seed(2025) 2025-05-07T20:32:05.2255855Z 2025-05-07T20:32:05.2256127Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.2256465Z 2025-05-07T20:32:05.2256665Z x_sign = torch.sign(x) 2025-05-07T20:32:05.2256980Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.2257285Z x = x_sign * x_clamp 2025-05-07T20:32:05.2257531Z x0 = x[:, :D] 2025-05-07T20:32:05.2258190Z x1 = x[:, D:] 2025-05-07T20:32:05.2258400Z 2025-05-07T20:32:05.2258599Z if contiguous: 2025-05-07T20:32:05.2258988Z x0 = x0.contiguous() 2025-05-07T20:32:05.2259248Z x1 = x1.contiguous() 2025-05-07T20:32:05.2259494Z 2025-05-07T20:32:05.2259693Z if scale_ub is not None: 2025-05-07T20:32:05.2260083Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.2260423Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.2260734Z ) 2025-05-07T20:32:05.2260934Z else: 2025-05-07T20:32:05.2261148Z scale_ub_tensor = None 2025-05-07T20:32:05.2261403Z 2025-05-07T20:32:05.2261642Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.2261957Z op = silu_mul_quant 2025-05-07T20:32:05.2262217Z if compiled: 2025-05-07T20:32:05.2262471Z op = torch.compile(op) 2025-05-07T20:32:05.2262770Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.2263056Z 2025-05-07T20:32:05.2263253Z y_fp8, y_scale = fn() 2025-05-07T20:32:05.2263546Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:05.2263839Z 2025-05-07T20:32:05.2264087Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.2264416Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:05.2264717Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:05.2265042Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:05.2265401Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:05.2265708Z 2025-05-07T20:32:05.2265911Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:05.2266105Z 2025-05-07T20:32:05.2266216Z moe/activation_test.py:126: 2025-05-07T20:32:05.2266510Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.2266844Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:05.2267181Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:05.2267973Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:05.2268722Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:05.2269274Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.2269955Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.2270639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:05.2271360Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:05.2272163Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:05.2272907Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:05.2273637Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:05.2274284Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:05.2274882Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:05.2275402Z fn() 2025-05-07T20:32:05.2275911Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:05.2276488Z self.fn.run( 2025-05-07T20:32:05.2276957Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.2277485Z kernel = self.compile( 2025-05-07T20:32:05.2278025Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.2278792Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.2279262Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.2279494Z 2025-05-07T20:32:05.2279704Z self = 2025-05-07T20:32:05.2280786Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.2282179Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07bd48bc70>} 2025-05-07T20:32:05.2283508Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.2284530Z context = 2025-05-07T20:32:05.2284823Z 2025-05-07T20:32:05.2284992Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.2285520Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.2285982Z module_map=module_map) 2025-05-07T20:32:05.2286347Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.2286705Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:05.2286975Z E ^ 2025-05-07T20:32:05.2287436Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.2287888Z 2025-05-07T20:32:05.2288310Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.2288836Z 2025-05-07T20:32:05.2288942Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.2289359Z self=, 2025-05-07T20:32:05.2289751Z T=4096, 2025-05-07T20:32:05.2290261Z D=5120, 2025-05-07T20:32:05.2290459Z scale_ub=None, 2025-05-07T20:32:05.2290670Z contiguous=True, 2025-05-07T20:32:05.2290892Z compiled=True, 2025-05-07T20:32:05.2291104Z ) 2025-05-07T20:32:05.6997040Z W0507 20:32:05.696000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:05.6998353Z W0507 20:32:05.696000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Traceback (most recent call last): 2025-05-07T20:32:05.6999691Z W0507 20:32:05.696000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:05.7001170Z W0507 20:32:05.696000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:05.7002551Z W0507 20:32:05.696000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:05.7003934Z W0507 20:32:05.696000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.7005235Z W0507 20:32:05.696000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:05.7007076Z W0507 20:32:05.696000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.7008499Z W0507 20:32:05.696000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:05.7009748Z W0507 20:32:05.696000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] generator.visit(fn.parse()) 2025-05-07T20:32:05.7010952Z W0507 20:32:05.696000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:05.7012151Z W0507 20:32:05.696000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ret = super().visit(node) 2025-05-07T20:32:05.7013197Z W0507 20:32:05.696000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:32:05.7014220Z W0507 20:32:05.696000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] return visitor(node) 2025-05-07T20:32:05.7015420Z W0507 20:32:05.696000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:05.7016687Z W0507 20:32:05.696000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:05.7017793Z W0507 20:32:05.696000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:32:05.7018833Z W0507 20:32:05.696000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] self.visit(item) 2025-05-07T20:32:05.7020070Z W0507 20:32:05.696000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:05.7021405Z W0507 20:32:05.696000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:05.7022458Z W0507 20:32:05.696000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.7023360Z W0507 20:32:05.696000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.7024097Z W0507 20:32:05.696000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^ 2025-05-07T20:32:05.7025109Z W0507 20:32:05.696000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.8621313Z W0507 20:32:05.858000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:05.8622638Z W0507 20:32:05.858000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Traceback (most recent call last): 2025-05-07T20:32:05.8623960Z W0507 20:32:05.858000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:05.8625897Z W0507 20:32:05.858000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:05.8627277Z W0507 20:32:05.858000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:05.8628664Z W0507 20:32:05.858000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.8629958Z W0507 20:32:05.858000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:05.8631324Z W0507 20:32:05.858000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.8632735Z W0507 20:32:05.858000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:05.8633973Z W0507 20:32:05.858000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] generator.visit(fn.parse()) 2025-05-07T20:32:05.8635179Z W0507 20:32:05.858000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:05.8636381Z W0507 20:32:05.858000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ret = super().visit(node) 2025-05-07T20:32:05.8637417Z W0507 20:32:05.858000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:32:05.8638426Z W0507 20:32:05.858000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] return visitor(node) 2025-05-07T20:32:05.8639642Z W0507 20:32:05.858000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:05.8640907Z W0507 20:32:05.858000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:05.8642005Z W0507 20:32:05.858000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:32:05.8643045Z W0507 20:32:05.858000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] self.visit(item) 2025-05-07T20:32:05.8644240Z W0507 20:32:05.858000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:05.8654526Z W0507 20:32:05.858000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:05.8655598Z W0507 20:32:05.858000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.8656505Z W0507 20:32:05.858000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.8657242Z W0507 20:32:05.858000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^ 2025-05-07T20:32:05.8658460Z W0507 20:32:05.858000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:06.4447407Z self = 2025-05-07T20:32:06.4447955Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:06.4448242Z 2025-05-07T20:32:06.4448330Z @given( 2025-05-07T20:32:06.4448570Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:06.4448891Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:06.4449205Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:06.4449536Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:06.4449878Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:06.4450171Z ) 2025-05-07T20:32:06.4450555Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:06.4451016Z def test_silu_mul_quant( 2025-05-07T20:32:06.4451273Z self, 2025-05-07T20:32:06.4451480Z T: int, 2025-05-07T20:32:06.4451682Z D: int, 2025-05-07T20:32:06.4451915Z scale_ub: Optional[float], 2025-05-07T20:32:06.4452198Z contiguous: bool, 2025-05-07T20:32:06.4452442Z compiled: bool, 2025-05-07T20:32:06.4452680Z ) -> None: 2025-05-07T20:32:06.4452907Z torch.manual_seed(2025) 2025-05-07T20:32:06.4453155Z 2025-05-07T20:32:06.4453438Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:06.4453790Z 2025-05-07T20:32:06.4453988Z x_sign = torch.sign(x) 2025-05-07T20:32:06.4454293Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:06.4454612Z x = x_sign * x_clamp 2025-05-07T20:32:06.4454856Z x0 = x[:, :D] 2025-05-07T20:32:06.4455088Z x1 = x[:, D:] 2025-05-07T20:32:06.4455306Z 2025-05-07T20:32:06.4455495Z if contiguous: 2025-05-07T20:32:06.4455746Z x0 = x0.contiguous() 2025-05-07T20:32:06.4456020Z x1 = x1.contiguous() 2025-05-07T20:32:06.4456258Z 2025-05-07T20:32:06.4456457Z if scale_ub is not None: 2025-05-07T20:32:06.4456737Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:06.4457078Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:06.4457391Z ) 2025-05-07T20:32:06.4457594Z else: 2025-05-07T20:32:06.4457817Z scale_ub_tensor = None 2025-05-07T20:32:06.4458072Z 2025-05-07T20:32:06.4458311Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:06.4458631Z op = silu_mul_quant 2025-05-07T20:32:06.4458881Z if compiled: 2025-05-07T20:32:06.4459137Z op = torch.compile(op) 2025-05-07T20:32:06.4459439Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:06.4459717Z 2025-05-07T20:32:06.4460000Z y_fp8, y_scale = fn() 2025-05-07T20:32:06.4460299Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:06.4460588Z 2025-05-07T20:32:06.4460828Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:06.4461163Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:06.4461462Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:06.4461775Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:06.4462140Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:06.4462459Z 2025-05-07T20:32:06.4462661Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:06.4462863Z 2025-05-07T20:32:06.4462965Z moe/activation_test.py:126: 2025-05-07T20:32:06.4463267Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:06.4463599Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:06.4464325Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:06.4465260Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:06.4466019Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:06.4466572Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:06.4467254Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:06.4467939Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:06.4468651Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:06.4469402Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:06.4470151Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:06.4470892Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:06.4471522Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:06.4472120Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:06.4472637Z fn() 2025-05-07T20:32:06.4473139Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:06.4473714Z self.fn.run( 2025-05-07T20:32:06.4474181Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:06.4474716Z kernel = self.compile( 2025-05-07T20:32:06.4475252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:06.4475907Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:06.4476311Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:06.4476540Z 2025-05-07T20:32:06.4476754Z self = 2025-05-07T20:32:06.4477821Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:06.4479200Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07bcfa4940>} 2025-05-07T20:32:06.4480533Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:06.4481560Z context = 2025-05-07T20:32:06.4481851Z 2025-05-07T20:32:06.4482017Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:06.4482543Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:06.4483010Z module_map=module_map) 2025-05-07T20:32:06.4483379Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:06.4483730Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:06.4484003Z E ^ 2025-05-07T20:32:06.4484469Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:06.4484916Z 2025-05-07T20:32:06.4485338Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:06.4485953Z 2025-05-07T20:32:06.4486061Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:06.4486556Z self=, 2025-05-07T20:32:06.4486969Z T=16384, 2025-05-07T20:32:06.4487163Z D=5120, 2025-05-07T20:32:06.4487365Z scale_ub=None, 2025-05-07T20:32:06.4487586Z contiguous=True, 2025-05-07T20:32:06.4487807Z compiled=True, 2025-05-07T20:32:06.4488020Z ) 2025-05-07T20:32:06.4879181Z W0507 20:32:06.486000 87525 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] torch._dynamo hit config.recompile_limit (8) 2025-05-07T20:32:06.4880429Z W0507 20:32:06.486000 87525 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55) 2025-05-07T20:32:06.4881752Z W0507 20:32:06.486000 87525 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] last reason: 0/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240 2025-05-07T20:32:06.4882799Z W0507 20:32:06.486000 87525 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To log all recompilation reasons, use TORCH_LOGS="recompiles". 2025-05-07T20:32:06.4883901Z W0507 20:32:06.486000 87525 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html. 2025-05-07T20:32:06.5912999Z self = 2025-05-07T20:32:06.5913524Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:06.5913798Z 2025-05-07T20:32:06.5913882Z @given( 2025-05-07T20:32:06.5914113Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:06.5914430Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:06.5914738Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:06.5915066Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:06.5915415Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:06.5915709Z ) 2025-05-07T20:32:06.5916070Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:06.5916511Z def test_silu_mul_quant( 2025-05-07T20:32:06.5916759Z self, 2025-05-07T20:32:06.5916952Z T: int, 2025-05-07T20:32:06.5917155Z D: int, 2025-05-07T20:32:06.5917383Z scale_ub: Optional[float], 2025-05-07T20:32:06.5917654Z contiguous: bool, 2025-05-07T20:32:06.5917903Z compiled: bool, 2025-05-07T20:32:06.5918143Z ) -> None: 2025-05-07T20:32:06.5918370Z torch.manual_seed(2025) 2025-05-07T20:32:06.5918610Z 2025-05-07T20:32:06.5918887Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:06.5919229Z 2025-05-07T20:32:06.5919422Z x_sign = torch.sign(x) 2025-05-07T20:32:06.5919732Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:06.5920052Z x = x_sign * x_clamp 2025-05-07T20:32:06.5920302Z x0 = x[:, :D] 2025-05-07T20:32:06.5920517Z x1 = x[:, D:] 2025-05-07T20:32:06.5920735Z 2025-05-07T20:32:06.5920931Z if contiguous: 2025-05-07T20:32:06.5921162Z x0 = x0.contiguous() 2025-05-07T20:32:06.5921429Z x1 = x1.contiguous() 2025-05-07T20:32:06.5921676Z 2025-05-07T20:32:06.5921870Z if scale_ub is not None: 2025-05-07T20:32:06.5922154Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:06.5922542Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:06.5922854Z ) 2025-05-07T20:32:06.5923056Z else: 2025-05-07T20:32:06.5923275Z scale_ub_tensor = None 2025-05-07T20:32:06.5923526Z 2025-05-07T20:32:06.5923769Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:06.5924087Z op = silu_mul_quant 2025-05-07T20:32:06.5924346Z if compiled: 2025-05-07T20:32:06.5924888Z op = torch.compile(op) 2025-05-07T20:32:06.5925193Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:06.5925472Z 2025-05-07T20:32:06.5925837Z y_fp8, y_scale = fn() 2025-05-07T20:32:06.5926138Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:06.5926435Z 2025-05-07T20:32:06.5926669Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:06.5927006Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:06.5927302Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:06.5927613Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:06.5927984Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:06.5928301Z 2025-05-07T20:32:06.5928507Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:06.5928702Z 2025-05-07T20:32:06.5928805Z moe/activation_test.py:126: 2025-05-07T20:32:06.5929108Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:06.5929449Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:06.5929782Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:06.5930573Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:06.5931327Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:06.5931876Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:06.5932554Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:06.5933246Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:06.5933966Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:06.5934711Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:06.5935464Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:06.5936189Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:06.5936832Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:06.5937426Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:06.5937942Z fn() 2025-05-07T20:32:06.5938453Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:06.5939031Z self.fn.run( 2025-05-07T20:32:06.5939491Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:06.5940111Z kernel = self.compile( 2025-05-07T20:32:06.5940657Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:06.5941316Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:06.5941717Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:06.5941952Z 2025-05-07T20:32:06.5942163Z self = 2025-05-07T20:32:06.5943246Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:06.5944610Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07bd48a9e0>} 2025-05-07T20:32:06.5945943Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:06.5947136Z context = 2025-05-07T20:32:06.5947424Z 2025-05-07T20:32:06.5947599Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:06.5948120Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:06.5948583Z module_map=module_map) 2025-05-07T20:32:06.5948956Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:06.5949317Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:06.5949581Z E ^ 2025-05-07T20:32:06.5950045Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:06.5950490Z 2025-05-07T20:32:06.5950916Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:06.5951438Z 2025-05-07T20:32:06.5951559Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:06.5951969Z self=, 2025-05-07T20:32:06.5952370Z T=1, 2025-05-07T20:32:06.5952561Z D=5120, 2025-05-07T20:32:06.5952753Z scale_ub=1200.0, 2025-05-07T20:32:06.5952979Z contiguous=True, 2025-05-07T20:32:06.5953205Z compiled=True, 2025-05-07T20:32:06.5953407Z ) 2025-05-07T20:32:06.7403716Z self = 2025-05-07T20:32:06.7404258Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:06.7404518Z 2025-05-07T20:32:06.7404610Z @given( 2025-05-07T20:32:06.7404842Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:06.7405158Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:06.7405469Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:06.7405833Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:06.7406170Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:06.7406461Z ) 2025-05-07T20:32:06.7406815Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:06.7407251Z def test_silu_mul_quant( 2025-05-07T20:32:06.7407494Z self, 2025-05-07T20:32:06.7407690Z T: int, 2025-05-07T20:32:06.7407886Z D: int, 2025-05-07T20:32:06.7408111Z scale_ub: Optional[float], 2025-05-07T20:32:06.7408389Z contiguous: bool, 2025-05-07T20:32:06.7408627Z compiled: bool, 2025-05-07T20:32:06.7408858Z ) -> None: 2025-05-07T20:32:06.7409080Z torch.manual_seed(2025) 2025-05-07T20:32:06.7409321Z 2025-05-07T20:32:06.7409597Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:06.7410115Z 2025-05-07T20:32:06.7410305Z x_sign = torch.sign(x) 2025-05-07T20:32:06.7410614Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:06.7410934Z x = x_sign * x_clamp 2025-05-07T20:32:06.7411188Z x0 = x[:, :D] 2025-05-07T20:32:06.7411408Z x1 = x[:, D:] 2025-05-07T20:32:06.7411622Z 2025-05-07T20:32:06.7411812Z if contiguous: 2025-05-07T20:32:06.7412045Z x0 = x0.contiguous() 2025-05-07T20:32:06.7412309Z x1 = x1.contiguous() 2025-05-07T20:32:06.7412599Z 2025-05-07T20:32:06.7412790Z if scale_ub is not None: 2025-05-07T20:32:06.7413069Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:06.7413407Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:06.7413711Z ) 2025-05-07T20:32:06.7413913Z else: 2025-05-07T20:32:06.7414130Z scale_ub_tensor = None 2025-05-07T20:32:06.7414394Z 2025-05-07T20:32:06.7414624Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:06.7415270Z op = silu_mul_quant 2025-05-07T20:32:06.7415526Z if compiled: 2025-05-07T20:32:06.7415780Z op = torch.compile(op) 2025-05-07T20:32:06.7416210Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:06.7416492Z 2025-05-07T20:32:06.7416686Z > y_fp8, y_scale = fn() 2025-05-07T20:32:06.7416860Z 2025-05-07T20:32:06.7416962Z moe/activation_test.py:117: 2025-05-07T20:32:06.7417280Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:06.7417619Z moe/activation_test.py:115: in fn 2025-05-07T20:32:06.7417900Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:06.7418463Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:06.7419028Z return fn(*args, **kwargs) 2025-05-07T20:32:06.7419685Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:06.7420470Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:06.7421014Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:06.7421699Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:06.7422359Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:06.7422892Z kernel = self.compile( 2025-05-07T20:32:06.7423435Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:06.7424097Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:06.7424490Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:06.7424728Z 2025-05-07T20:32:06.7424936Z self = 2025-05-07T20:32:06.7426021Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:06.7427410Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07bcfa68c0>} 2025-05-07T20:32:06.7428739Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:06.7429775Z context = 2025-05-07T20:32:06.7430067Z 2025-05-07T20:32:06.7430239Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:06.7430761Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:06.7431235Z module_map=module_map) 2025-05-07T20:32:06.7431609Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:06.7431967Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:06.7432223Z E ^ 2025-05-07T20:32:06.7432693Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:06.7433144Z 2025-05-07T20:32:06.7433567Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:06.7434079Z 2025-05-07T20:32:06.7434191Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:06.7434598Z self=, 2025-05-07T20:32:06.7435000Z T=1, 2025-05-07T20:32:06.7435187Z D=5120, 2025-05-07T20:32:06.7435379Z scale_ub=None, 2025-05-07T20:32:06.7435599Z contiguous=False, 2025-05-07T20:32:06.7435935Z compiled=True, 2025-05-07T20:32:06.7436144Z ) 2025-05-07T20:32:06.8114192Z self = 2025-05-07T20:32:06.8114719Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:06.8114989Z 2025-05-07T20:32:06.8115065Z @given( 2025-05-07T20:32:06.8115303Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:06.8115616Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:06.8115928Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:06.8116261Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:06.8116587Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:06.8116876Z ) 2025-05-07T20:32:06.8117231Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:06.8117684Z def test_silu_mul_quant( 2025-05-07T20:32:06.8117926Z self, 2025-05-07T20:32:06.8118128Z T: int, 2025-05-07T20:32:06.8118339Z D: int, 2025-05-07T20:32:06.8118558Z scale_ub: Optional[float], 2025-05-07T20:32:06.8118841Z contiguous: bool, 2025-05-07T20:32:06.8119088Z compiled: bool, 2025-05-07T20:32:06.8119311Z ) -> None: 2025-05-07T20:32:06.8119531Z torch.manual_seed(2025) 2025-05-07T20:32:06.8119772Z 2025-05-07T20:32:06.8120040Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:06.8120381Z 2025-05-07T20:32:06.8120578Z x_sign = torch.sign(x) 2025-05-07T20:32:06.8120868Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:06.8121181Z x = x_sign * x_clamp 2025-05-07T20:32:06.8121426Z x0 = x[:, :D] 2025-05-07T20:32:06.8121640Z x1 = x[:, D:] 2025-05-07T20:32:06.8121853Z 2025-05-07T20:32:06.8122043Z if contiguous: 2025-05-07T20:32:06.8122277Z x0 = x0.contiguous() 2025-05-07T20:32:06.8122539Z x1 = x1.contiguous() 2025-05-07T20:32:06.8122790Z 2025-05-07T20:32:06.8122987Z if scale_ub is not None: 2025-05-07T20:32:06.8123257Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:06.8123601Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:06.8123912Z ) 2025-05-07T20:32:06.8124103Z else: 2025-05-07T20:32:06.8124320Z scale_ub_tensor = None 2025-05-07T20:32:06.8124574Z 2025-05-07T20:32:06.8124802Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:06.8125118Z op = silu_mul_quant 2025-05-07T20:32:06.8125375Z if compiled: 2025-05-07T20:32:06.8125623Z op = torch.compile(op) 2025-05-07T20:32:06.8125924Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:06.8126203Z 2025-05-07T20:32:06.8126394Z y_fp8, y_scale = fn() 2025-05-07T20:32:06.8126681Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:06.8126980Z 2025-05-07T20:32:06.8127219Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:06.8127553Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:06.8127858Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:06.8128178Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:06.8128537Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:06.8128850Z 2025-05-07T20:32:06.8129055Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:06.8129251Z 2025-05-07T20:32:06.8129352Z moe/activation_test.py:126: 2025-05-07T20:32:06.8129654Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:06.8129988Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:06.8130319Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:06.8131105Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:06.8132066Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:06.8132700Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:06.8133379Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:06.8134070Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:06.8134793Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:06.8135558Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:06.8136298Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:06.8137027Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:06.8137684Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:06.8138301Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:06.8138814Z fn() 2025-05-07T20:32:06.8139330Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:06.8140015Z self.fn.run( 2025-05-07T20:32:06.8140506Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:06.8141041Z kernel = self.compile( 2025-05-07T20:32:06.8149580Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:06.8150267Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:06.8150676Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:06.8150909Z 2025-05-07T20:32:06.8151129Z self = 2025-05-07T20:32:06.8152224Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:06.8153648Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07bc8a7880>} 2025-05-07T20:32:06.8155000Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:06.8156043Z context = 2025-05-07T20:32:06.8156332Z 2025-05-07T20:32:06.8156503Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:06.8157039Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:06.8157506Z module_map=module_map) 2025-05-07T20:32:06.8157873Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:06.8158241Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:06.8158516Z E ^ 2025-05-07T20:32:06.8158990Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:06.8159438Z 2025-05-07T20:32:06.8159867Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:06.8160395Z 2025-05-07T20:32:06.8160503Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:06.8160927Z self=, 2025-05-07T20:32:06.8161333Z T=1, 2025-05-07T20:32:06.8161519Z D=5120, 2025-05-07T20:32:06.8161859Z scale_ub=None, 2025-05-07T20:32:06.8162087Z contiguous=True, 2025-05-07T20:32:06.8162312Z compiled=False, 2025-05-07T20:32:06.8162664Z ) 2025-05-07T20:32:07.1606876Z self = 2025-05-07T20:32:07.1607422Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:07.1607685Z 2025-05-07T20:32:07.1607775Z @given( 2025-05-07T20:32:07.1608009Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:07.1608333Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:07.1608649Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:07.1608989Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:07.1609324Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:07.1609613Z ) 2025-05-07T20:32:07.1609969Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:07.1610436Z def test_silu_mul_quant( 2025-05-07T20:32:07.1610682Z self, 2025-05-07T20:32:07.1610884Z T: int, 2025-05-07T20:32:07.1611080Z D: int, 2025-05-07T20:32:07.1611317Z scale_ub: Optional[float], 2025-05-07T20:32:07.1611595Z contiguous: bool, 2025-05-07T20:32:07.1611835Z compiled: bool, 2025-05-07T20:32:07.1612070Z ) -> None: 2025-05-07T20:32:07.1612297Z torch.manual_seed(2025) 2025-05-07T20:32:07.1612536Z 2025-05-07T20:32:07.1612816Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:07.1613166Z 2025-05-07T20:32:07.1613360Z x_sign = torch.sign(x) 2025-05-07T20:32:07.1613662Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:07.1613976Z x = x_sign * x_clamp 2025-05-07T20:32:07.1614227Z x0 = x[:, :D] 2025-05-07T20:32:07.1614443Z x1 = x[:, D:] 2025-05-07T20:32:07.1614657Z 2025-05-07T20:32:07.1614856Z if contiguous: 2025-05-07T20:32:07.1615097Z x0 = x0.contiguous() 2025-05-07T20:32:07.1615365Z x1 = x1.contiguous() 2025-05-07T20:32:07.1615605Z 2025-05-07T20:32:07.1615800Z if scale_ub is not None: 2025-05-07T20:32:07.1616081Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:07.1616425Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:07.1616729Z ) 2025-05-07T20:32:07.1616929Z else: 2025-05-07T20:32:07.1617152Z scale_ub_tensor = None 2025-05-07T20:32:07.1617415Z 2025-05-07T20:32:07.1617650Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:07.1617959Z op = silu_mul_quant 2025-05-07T20:32:07.1618218Z if compiled: 2025-05-07T20:32:07.1618470Z op = torch.compile(op) 2025-05-07T20:32:07.1618766Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:07.1619035Z 2025-05-07T20:32:07.1619233Z > y_fp8, y_scale = fn() 2025-05-07T20:32:07.1619397Z 2025-05-07T20:32:07.1619506Z moe/activation_test.py:117: 2025-05-07T20:32:07.1619898Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:07.1620236Z moe/activation_test.py:115: in fn 2025-05-07T20:32:07.1620525Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:07.1621215Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:07.1621911Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:07.1622450Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:07.1623132Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:07.1623787Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:07.1624325Z kernel = self.compile( 2025-05-07T20:32:07.1624868Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:07.1626031Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:07.1626432Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:07.1626664Z 2025-05-07T20:32:07.1626872Z self = 2025-05-07T20:32:07.1627953Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:07.1629323Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07bc8a6b00>} 2025-05-07T20:32:07.1630646Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:07.1631690Z context = 2025-05-07T20:32:07.1631979Z 2025-05-07T20:32:07.1632151Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:07.1632672Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:07.1633143Z module_map=module_map) 2025-05-07T20:32:07.1633514Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:07.1633866Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:07.1634128Z E ^ 2025-05-07T20:32:07.1634591Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:07.1636293Z 2025-05-07T20:32:07.1636715Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:07.1637231Z 2025-05-07T20:32:07.1637342Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:07.1637756Z self=, 2025-05-07T20:32:07.1638154Z T=128, 2025-05-07T20:32:07.1638344Z D=5120, 2025-05-07T20:32:07.1638532Z scale_ub=None, 2025-05-07T20:32:07.1638752Z contiguous=False, 2025-05-07T20:32:07.1638981Z compiled=True, 2025-05-07T20:32:07.1639185Z ) 2025-05-07T20:32:07.1639509Z self = 2025-05-07T20:32:07.1640018Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:07.1640294Z 2025-05-07T20:32:07.1640375Z @given( 2025-05-07T20:32:07.1640612Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:07.1640918Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:07.1641226Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:07.1641562Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:07.1641887Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:07.1642172Z ) 2025-05-07T20:32:07.1642525Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:07.1642961Z def test_silu_mul_quant( 2025-05-07T20:32:07.1643201Z self, 2025-05-07T20:32:07.1643398Z T: int, 2025-05-07T20:32:07.1643593Z D: int, 2025-05-07T20:32:07.1643814Z scale_ub: Optional[float], 2025-05-07T20:32:07.1644090Z contiguous: bool, 2025-05-07T20:32:07.1644331Z compiled: bool, 2025-05-07T20:32:07.1644551Z ) -> None: 2025-05-07T20:32:07.1644772Z torch.manual_seed(2025) 2025-05-07T20:32:07.1645014Z 2025-05-07T20:32:07.1645283Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:07.1645624Z 2025-05-07T20:32:07.1645820Z x_sign = torch.sign(x) 2025-05-07T20:32:07.1646201Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:07.1646511Z x = x_sign * x_clamp 2025-05-07T20:32:07.1646830Z x0 = x[:, :D] 2025-05-07T20:32:07.1647047Z x1 = x[:, D:] 2025-05-07T20:32:07.1647256Z 2025-05-07T20:32:07.1647442Z if contiguous: 2025-05-07T20:32:07.1647668Z x0 = x0.contiguous() 2025-05-07T20:32:07.1647927Z x1 = x1.contiguous() 2025-05-07T20:32:07.1648171Z 2025-05-07T20:32:07.1648360Z if scale_ub is not None: 2025-05-07T20:32:07.1648632Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:07.1648966Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:07.1649280Z ) 2025-05-07T20:32:07.1649467Z else: 2025-05-07T20:32:07.1649681Z scale_ub_tensor = None 2025-05-07T20:32:07.1649934Z 2025-05-07T20:32:07.1650162Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:07.1650477Z op = silu_mul_quant 2025-05-07T20:32:07.1650741Z if compiled: 2025-05-07T20:32:07.1650993Z op = torch.compile(op) 2025-05-07T20:32:07.1651301Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:07.1651576Z 2025-05-07T20:32:07.1651768Z > y_fp8, y_scale = fn() 2025-05-07T20:32:07.1651941Z 2025-05-07T20:32:07.1652044Z moe/activation_test.py:117: 2025-05-07T20:32:07.1652344Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:07.1652672Z moe/activation_test.py:115: in fn 2025-05-07T20:32:07.1652960Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:07.1653518Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:07.1654087Z return fn(*args, **kwargs) 2025-05-07T20:32:07.1654746Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:07.1655442Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:07.1655985Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:07.1656660Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:07.1657320Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:07.1657850Z kernel = self.compile( 2025-05-07T20:32:07.1658391Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:07.1659042Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:07.1659436Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:07.1659662Z 2025-05-07T20:32:07.1659970Z self = 2025-05-07T20:32:07.1661043Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:07.1662419Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07bc56b370>} 2025-05-07T20:32:07.1663752Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:07.1664767Z context = 2025-05-07T20:32:07.1665055Z 2025-05-07T20:32:07.1665230Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:07.1665745Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:07.1666306Z module_map=module_map) 2025-05-07T20:32:07.1666669Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:07.1667132Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:07.1667394Z E ^ 2025-05-07T20:32:07.1667860Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:07.1668305Z 2025-05-07T20:32:07.1668723Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:07.1669228Z 2025-05-07T20:32:07.1669338Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:07.1669746Z self=, 2025-05-07T20:32:07.1670146Z T=128, 2025-05-07T20:32:07.1670340Z D=7168, 2025-05-07T20:32:07.1670528Z scale_ub=1200.0, 2025-05-07T20:32:07.1670756Z contiguous=False, 2025-05-07T20:32:07.1670988Z compiled=False, 2025-05-07T20:32:07.1671201Z ) 2025-05-07T20:32:07.2942602Z self = 2025-05-07T20:32:07.2943200Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:07.2943475Z 2025-05-07T20:32:07.2943560Z @given( 2025-05-07T20:32:07.2943790Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:07.2944104Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:07.2944413Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:07.2944740Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:07.2945075Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:07.2945363Z ) 2025-05-07T20:32:07.2945705Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:07.2946158Z def test_silu_mul_quant( 2025-05-07T20:32:07.2946401Z self, 2025-05-07T20:32:07.2946601Z T: int, 2025-05-07T20:32:07.2946805Z D: int, 2025-05-07T20:32:07.2947030Z scale_ub: Optional[float], 2025-05-07T20:32:07.2947301Z contiguous: bool, 2025-05-07T20:32:07.2947546Z compiled: bool, 2025-05-07T20:32:07.2947773Z ) -> None: 2025-05-07T20:32:07.2947992Z torch.manual_seed(2025) 2025-05-07T20:32:07.2948231Z 2025-05-07T20:32:07.2948504Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:07.2948843Z 2025-05-07T20:32:07.2949034Z x_sign = torch.sign(x) 2025-05-07T20:32:07.2949330Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:07.2949640Z x = x_sign * x_clamp 2025-05-07T20:32:07.2949902Z x0 = x[:, :D] 2025-05-07T20:32:07.2950122Z x1 = x[:, D:] 2025-05-07T20:32:07.2950323Z 2025-05-07T20:32:07.2950508Z if contiguous: 2025-05-07T20:32:07.2950743Z x0 = x0.contiguous() 2025-05-07T20:32:07.2950995Z x1 = x1.contiguous() 2025-05-07T20:32:07.2951235Z 2025-05-07T20:32:07.2951435Z if scale_ub is not None: 2025-05-07T20:32:07.2951702Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:07.2952042Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:07.2952354Z ) 2025-05-07T20:32:07.2952547Z else: 2025-05-07T20:32:07.2952767Z scale_ub_tensor = None 2025-05-07T20:32:07.2953015Z 2025-05-07T20:32:07.2953249Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:07.2953553Z op = silu_mul_quant 2025-05-07T20:32:07.2953805Z if compiled: 2025-05-07T20:32:07.2954052Z op = torch.compile(op) 2025-05-07T20:32:07.2954345Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:07.2954619Z 2025-05-07T20:32:07.2954818Z > y_fp8, y_scale = fn() 2025-05-07T20:32:07.2954984Z 2025-05-07T20:32:07.2955082Z moe/activation_test.py:117: 2025-05-07T20:32:07.2955377Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:07.2956007Z moe/activation_test.py:115: in fn 2025-05-07T20:32:07.2956285Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:07.2957092Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:07.2957797Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:07.2958337Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:07.2959011Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:07.2959673Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:07.2960210Z kernel = self.compile( 2025-05-07T20:32:07.2960756Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:07.2961404Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:07.2961811Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:07.2962041Z 2025-05-07T20:32:07.2962258Z self = 2025-05-07T20:32:07.2963332Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:07.2964689Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07bc56a560>} 2025-05-07T20:32:07.2966034Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:07.2967054Z context = 2025-05-07T20:32:07.2967337Z 2025-05-07T20:32:07.2967515Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:07.2968031Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:07.2968505Z module_map=module_map) 2025-05-07T20:32:07.2968872Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:07.2969227Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:07.2969487Z E ^ 2025-05-07T20:32:07.2969955Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:07.2970399Z 2025-05-07T20:32:07.2970823Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:07.2971340Z 2025-05-07T20:32:07.2971456Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:07.2971865Z self=, 2025-05-07T20:32:07.2972269Z T=128, 2025-05-07T20:32:07.2972460Z D=5120, 2025-05-07T20:32:07.2972652Z scale_ub=None, 2025-05-07T20:32:07.2972873Z contiguous=False, 2025-05-07T20:32:07.2973107Z compiled=False, 2025-05-07T20:32:07.2973307Z ) 2025-05-07T20:32:07.2973628Z self = 2025-05-07T20:32:07.2974119Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:07.2974384Z 2025-05-07T20:32:07.2974462Z @given( 2025-05-07T20:32:07.2974695Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:07.2975011Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:07.2975318Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:07.2975650Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:07.2975982Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:07.2976362Z ) 2025-05-07T20:32:07.2976784Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:07.2977235Z def test_silu_mul_quant( 2025-05-07T20:32:07.2977483Z self, 2025-05-07T20:32:07.2977673Z T: int, 2025-05-07T20:32:07.2977873Z D: int, 2025-05-07T20:32:07.2978095Z scale_ub: Optional[float], 2025-05-07T20:32:07.2978360Z contiguous: bool, 2025-05-07T20:32:07.2978604Z compiled: bool, 2025-05-07T20:32:07.2978832Z ) -> None: 2025-05-07T20:32:07.2979044Z torch.manual_seed(2025) 2025-05-07T20:32:07.2979288Z 2025-05-07T20:32:07.2979563Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:07.2979997Z 2025-05-07T20:32:07.2980189Z x_sign = torch.sign(x) 2025-05-07T20:32:07.2980483Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:07.2980790Z x = x_sign * x_clamp 2025-05-07T20:32:07.2981033Z x0 = x[:, :D] 2025-05-07T20:32:07.2981248Z x1 = x[:, D:] 2025-05-07T20:32:07.2981461Z 2025-05-07T20:32:07.2981646Z if contiguous: 2025-05-07T20:32:07.2981884Z x0 = x0.contiguous() 2025-05-07T20:32:07.2982142Z x1 = x1.contiguous() 2025-05-07T20:32:07.2982378Z 2025-05-07T20:32:07.2982583Z if scale_ub is not None: 2025-05-07T20:32:07.2982907Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:07.2983236Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:07.2983550Z ) 2025-05-07T20:32:07.2983746Z else: 2025-05-07T20:32:07.2983959Z scale_ub_tensor = None 2025-05-07T20:32:07.2984211Z 2025-05-07T20:32:07.2984444Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:07.2984754Z op = silu_mul_quant 2025-05-07T20:32:07.2985010Z if compiled: 2025-05-07T20:32:07.2985262Z op = torch.compile(op) 2025-05-07T20:32:07.2985569Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:07.2985836Z 2025-05-07T20:32:07.2986038Z > y_fp8, y_scale = fn() 2025-05-07T20:32:07.2986205Z 2025-05-07T20:32:07.2986312Z moe/activation_test.py:117: 2025-05-07T20:32:07.2986608Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:07.2986939Z moe/activation_test.py:115: in fn 2025-05-07T20:32:07.2987227Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:07.2987904Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:07.2988593Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:07.2989127Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:07.2989802Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:07.2990742Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:07.2991282Z kernel = self.compile( 2025-05-07T20:32:07.2991824Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:07.2992476Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:07.2992864Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:07.2993097Z 2025-05-07T20:32:07.2993303Z self = 2025-05-07T20:32:07.2994367Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:07.2995722Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07bc568550>} 2025-05-07T20:32:07.2997291Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:07.2998328Z context = 2025-05-07T20:32:07.2998618Z 2025-05-07T20:32:07.2998783Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:07.2999302Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:07.2999760Z module_map=module_map) 2025-05-07T20:32:07.3000128Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:07.3000482Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:07.3000738Z E ^ 2025-05-07T20:32:07.3001204Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:07.3001659Z 2025-05-07T20:32:07.3002086Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:07.3002595Z 2025-05-07T20:32:07.3002706Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:07.3003110Z self=, 2025-05-07T20:32:07.3003507Z T=128, 2025-05-07T20:32:07.3003695Z D=5120, 2025-05-07T20:32:07.3003883Z scale_ub=1200.0, 2025-05-07T20:32:07.3004111Z contiguous=True, 2025-05-07T20:32:07.3004341Z compiled=False, 2025-05-07T20:32:07.3004546Z ) 2025-05-07T20:32:07.4946503Z self = 2025-05-07T20:32:07.4947414Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:07.4947890Z 2025-05-07T20:32:07.4948025Z @given( 2025-05-07T20:32:07.4948442Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:07.4948961Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:07.4949463Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:07.4950012Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:07.4950562Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:07.4951061Z ) 2025-05-07T20:32:07.4951619Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:07.4952302Z def test_silu_mul_quant( 2025-05-07T20:32:07.4952668Z self, 2025-05-07T20:32:07.4952979Z T: int, 2025-05-07T20:32:07.4953284Z D: int, 2025-05-07T20:32:07.4953623Z scale_ub: Optional[float], 2025-05-07T20:32:07.4954069Z contiguous: bool, 2025-05-07T20:32:07.4954445Z compiled: bool, 2025-05-07T20:32:07.4954805Z ) -> None: 2025-05-07T20:32:07.4955175Z torch.manual_seed(2025) 2025-05-07T20:32:07.4966365Z 2025-05-07T20:32:07.4966856Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:07.4967437Z 2025-05-07T20:32:07.4967772Z x_sign = torch.sign(x) 2025-05-07T20:32:07.4968255Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:07.4968764Z x = x_sign * x_clamp 2025-05-07T20:32:07.4969165Z x0 = x[:, :D] 2025-05-07T20:32:07.4969518Z x1 = x[:, D:] 2025-05-07T20:32:07.4969840Z 2025-05-07T20:32:07.4970142Z if contiguous: 2025-05-07T20:32:07.4970520Z x0 = x0.contiguous() 2025-05-07T20:32:07.4970943Z x1 = x1.contiguous() 2025-05-07T20:32:07.4971349Z 2025-05-07T20:32:07.4971663Z if scale_ub is not None: 2025-05-07T20:32:07.4972107Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:07.4972685Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:07.4973219Z ) 2025-05-07T20:32:07.4973539Z else: 2025-05-07T20:32:07.4974330Z scale_ub_tensor = None 2025-05-07T20:32:07.4974759Z 2025-05-07T20:32:07.4975153Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:07.4975876Z op = silu_mul_quant 2025-05-07T20:32:07.4976296Z if compiled: 2025-05-07T20:32:07.4976719Z op = torch.compile(op) 2025-05-07T20:32:07.4977216Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:07.4977677Z 2025-05-07T20:32:07.4977992Z > y_fp8, y_scale = fn() 2025-05-07T20:32:07.4978262Z 2025-05-07T20:32:07.4978421Z moe/activation_test.py:117: 2025-05-07T20:32:07.4978915Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:07.4979460Z moe/activation_test.py:115: in fn 2025-05-07T20:32:07.4980031Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:07.4981182Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:07.4982403Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:07.4983314Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:07.4984425Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:07.4985557Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:07.4986474Z kernel = self.compile( 2025-05-07T20:32:07.4987397Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:07.4988525Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:07.4989207Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:07.4989593Z 2025-05-07T20:32:07.4990147Z self = 2025-05-07T20:32:07.4991988Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:07.4994323Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07bc568700>} 2025-05-07T20:32:07.4996701Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:07.4998524Z context = 2025-05-07T20:32:07.4999002Z 2025-05-07T20:32:07.4999290Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:07.5000166Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:07.5000988Z module_map=module_map) 2025-05-07T20:32:07.5001600Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:07.5002189Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:07.5002641Z E ^ 2025-05-07T20:32:07.5003459Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:07.5004261Z 2025-05-07T20:32:07.5005004Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:07.5005901Z 2025-05-07T20:32:07.5006073Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:07.5006769Z self=, 2025-05-07T20:32:07.5007465Z T=1, 2025-05-07T20:32:07.5007762Z D=7168, 2025-05-07T20:32:07.5008071Z scale_ub=1200.0, 2025-05-07T20:32:07.5008429Z contiguous=True, 2025-05-07T20:32:07.5009012Z compiled=True, 2025-05-07T20:32:07.5009334Z ) 2025-05-07T20:32:07.5009865Z self = 2025-05-07T20:32:07.5010856Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:07.5011315Z 2025-05-07T20:32:07.5011438Z @given( 2025-05-07T20:32:07.5011819Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:07.5012340Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:07.5012840Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:07.5013390Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:07.5013941Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:07.5014428Z ) 2025-05-07T20:32:07.5015009Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:07.5015755Z def test_silu_mul_quant( 2025-05-07T20:32:07.5016152Z self, 2025-05-07T20:32:07.5016457Z T: int, 2025-05-07T20:32:07.5016781Z D: int, 2025-05-07T20:32:07.5017118Z scale_ub: Optional[float], 2025-05-07T20:32:07.5017566Z contiguous: bool, 2025-05-07T20:32:07.5017962Z compiled: bool, 2025-05-07T20:32:07.5018316Z ) -> None: 2025-05-07T20:32:07.5018660Z torch.manual_seed(2025) 2025-05-07T20:32:07.5019044Z 2025-05-07T20:32:07.5019474Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:07.5020186Z 2025-05-07T20:32:07.5020504Z x_sign = torch.sign(x) 2025-05-07T20:32:07.5020966Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:07.5021489Z x = x_sign * x_clamp 2025-05-07T20:32:07.5021889Z x0 = x[:, :D] 2025-05-07T20:32:07.5022227Z x1 = x[:, D:] 2025-05-07T20:32:07.5022562Z 2025-05-07T20:32:07.5022855Z if contiguous: 2025-05-07T20:32:07.5023216Z x0 = x0.contiguous() 2025-05-07T20:32:07.5023635Z x1 = x1.contiguous() 2025-05-07T20:32:07.5024033Z 2025-05-07T20:32:07.5024330Z if scale_ub is not None: 2025-05-07T20:32:07.5024779Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:07.5025345Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:07.5025856Z ) 2025-05-07T20:32:07.5026163Z else: 2025-05-07T20:32:07.5026501Z scale_ub_tensor = None 2025-05-07T20:32:07.5026922Z 2025-05-07T20:32:07.5027290Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:07.5027820Z op = silu_mul_quant 2025-05-07T20:32:07.5028229Z if compiled: 2025-05-07T20:32:07.5028626Z op = torch.compile(op) 2025-05-07T20:32:07.5029117Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:07.5029580Z 2025-05-07T20:32:07.5029882Z > y_fp8, y_scale = fn() 2025-05-07T20:32:07.5030165Z 2025-05-07T20:32:07.5030327Z moe/activation_test.py:117: 2025-05-07T20:32:07.5030819Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:07.5031388Z moe/activation_test.py:115: in fn 2025-05-07T20:32:07.5031852Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:07.5032832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:07.5033772Z return fn(*args, **kwargs) 2025-05-07T20:32:07.5034885Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:07.5036109Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:07.5037073Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:07.5038203Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:07.5039225Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:07.5040148Z kernel = self.compile( 2025-05-07T20:32:07.5041267Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:07.5042536Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:07.5043221Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:07.5043629Z 2025-05-07T20:32:07.5043975Z self = 2025-05-07T20:32:07.5045908Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:07.5048409Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07bcd18280>} 2025-05-07T20:32:07.5050853Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:07.5052712Z context = 2025-05-07T20:32:07.5053214Z 2025-05-07T20:32:07.5053504Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:07.5054418Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:07.5055222Z module_map=module_map) 2025-05-07T20:32:07.5055833Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:07.5056423Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:07.5056850Z E ^ 2025-05-07T20:32:07.5057662Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:07.5058468Z 2025-05-07T20:32:07.5059226Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:07.5060226Z 2025-05-07T20:32:07.5060412Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:07.5061090Z self=, 2025-05-07T20:32:07.5061754Z T=1, 2025-05-07T20:32:07.5062041Z D=7168, 2025-05-07T20:32:07.5062333Z scale_ub=1200.0, 2025-05-07T20:32:07.5062693Z contiguous=False, 2025-05-07T20:32:07.5063048Z compiled=True, 2025-05-07T20:32:07.5063365Z ) 2025-05-07T20:32:07.6447793Z self = 2025-05-07T20:32:07.6448657Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:07.6449106Z 2025-05-07T20:32:07.6449243Z @given( 2025-05-07T20:32:07.6449613Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:07.6450136Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:07.6450646Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:07.6451182Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:07.6451749Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:07.6452233Z ) 2025-05-07T20:32:07.6452843Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:07.6453513Z def test_silu_mul_quant( 2025-05-07T20:32:07.6453876Z self, 2025-05-07T20:32:07.6454171Z T: int, 2025-05-07T20:32:07.6454472Z D: int, 2025-05-07T20:32:07.6454819Z scale_ub: Optional[float], 2025-05-07T20:32:07.6455232Z contiguous: bool, 2025-05-07T20:32:07.6455594Z compiled: bool, 2025-05-07T20:32:07.6455941Z ) -> None: 2025-05-07T20:32:07.6456289Z torch.manual_seed(2025) 2025-05-07T20:32:07.6456693Z 2025-05-07T20:32:07.6457143Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:07.6457690Z 2025-05-07T20:32:07.6458419Z x_sign = torch.sign(x) 2025-05-07T20:32:07.6458897Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:07.6459632Z x = x_sign * x_clamp 2025-05-07T20:32:07.6460169Z x0 = x[:, :D] 2025-05-07T20:32:07.6460514Z x1 = x[:, D:] 2025-05-07T20:32:07.6460845Z 2025-05-07T20:32:07.6461144Z if contiguous: 2025-05-07T20:32:07.6461517Z x0 = x0.contiguous() 2025-05-07T20:32:07.6461952Z x1 = x1.contiguous() 2025-05-07T20:32:07.6462349Z 2025-05-07T20:32:07.6462676Z if scale_ub is not None: 2025-05-07T20:32:07.6463152Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:07.6463701Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:07.6464200Z ) 2025-05-07T20:32:07.6464517Z else: 2025-05-07T20:32:07.6464858Z scale_ub_tensor = None 2025-05-07T20:32:07.6465270Z 2025-05-07T20:32:07.6465650Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:07.6466198Z op = silu_mul_quant 2025-05-07T20:32:07.6466607Z if compiled: 2025-05-07T20:32:07.6467031Z op = torch.compile(op) 2025-05-07T20:32:07.6467528Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:07.6467980Z 2025-05-07T20:32:07.6468292Z > y_fp8, y_scale = fn() 2025-05-07T20:32:07.6468581Z 2025-05-07T20:32:07.6468743Z moe/activation_test.py:117: 2025-05-07T20:32:07.6469237Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:07.6469807Z moe/activation_test.py:115: in fn 2025-05-07T20:32:07.6470264Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:07.6471230Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:07.6472186Z return fn(*args, **kwargs) 2025-05-07T20:32:07.6473334Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:07.6474557Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:07.6475514Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:07.6476703Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:07.6477890Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:07.6478817Z kernel = self.compile( 2025-05-07T20:32:07.6479753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:07.6480877Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:07.6481523Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:07.6481901Z 2025-05-07T20:32:07.6482241Z self = 2025-05-07T20:32:07.6484176Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:07.6486585Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f06abcf75b0>} 2025-05-07T20:32:07.6488955Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:07.6491126Z context = 2025-05-07T20:32:07.6491619Z 2025-05-07T20:32:07.6491902Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:07.6492772Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:07.6493783Z module_map=module_map) 2025-05-07T20:32:07.6494511Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:07.6495058Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:07.6495474Z E ^ 2025-05-07T20:32:07.6496265Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:07.6497058Z 2025-05-07T20:32:07.6497796Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:07.6498703Z 2025-05-07T20:32:07.6498871Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:07.6499575Z self=, 2025-05-07T20:32:07.6500380Z T=1, 2025-05-07T20:32:07.6500664Z D=7168, 2025-05-07T20:32:07.6500967Z scale_ub=None, 2025-05-07T20:32:07.6501327Z contiguous=False, 2025-05-07T20:32:07.6501681Z compiled=True, 2025-05-07T20:32:07.6502015Z ) 2025-05-07T20:32:07.7460206Z self = 2025-05-07T20:32:07.7461052Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:07.7461477Z 2025-05-07T20:32:07.7461603Z @given( 2025-05-07T20:32:07.7462003Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:07.7462518Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:07.7463007Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:07.7463538Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:07.7464093Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:07.7464581Z ) 2025-05-07T20:32:07.7465085Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:07.7465714Z def test_silu_mul_quant( 2025-05-07T20:32:07.7466058Z self, 2025-05-07T20:32:07.7466346Z T: int, 2025-05-07T20:32:07.7466616Z D: int, 2025-05-07T20:32:07.7466923Z scale_ub: Optional[float], 2025-05-07T20:32:07.7467337Z contiguous: bool, 2025-05-07T20:32:07.7467684Z compiled: bool, 2025-05-07T20:32:07.7468028Z ) -> None: 2025-05-07T20:32:07.7468358Z torch.manual_seed(2025) 2025-05-07T20:32:07.7468728Z 2025-05-07T20:32:07.7469145Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:07.7469652Z 2025-05-07T20:32:07.7469933Z x_sign = torch.sign(x) 2025-05-07T20:32:07.7470376Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:07.7470852Z x = x_sign * x_clamp 2025-05-07T20:32:07.7471223Z x0 = x[:, :D] 2025-05-07T20:32:07.7471560Z x1 = x[:, D:] 2025-05-07T20:32:07.7471900Z 2025-05-07T20:32:07.7472190Z if contiguous: 2025-05-07T20:32:07.7472561Z x0 = x0.contiguous() 2025-05-07T20:32:07.7473025Z x1 = x1.contiguous() 2025-05-07T20:32:07.7473433Z 2025-05-07T20:32:07.7473764Z if scale_ub is not None: 2025-05-07T20:32:07.7474223Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:07.7474748Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:07.7475255Z ) 2025-05-07T20:32:07.7475563Z else: 2025-05-07T20:32:07.7475902Z scale_ub_tensor = None 2025-05-07T20:32:07.7476299Z 2025-05-07T20:32:07.7476663Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:07.7477182Z op = silu_mul_quant 2025-05-07T20:32:07.7477575Z if compiled: 2025-05-07T20:32:07.7477981Z op = torch.compile(op) 2025-05-07T20:32:07.7478478Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:07.7478931Z 2025-05-07T20:32:07.7479249Z y_fp8, y_scale = fn() 2025-05-07T20:32:07.7479727Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:07.7480635Z 2025-05-07T20:32:07.7481023Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:07.7481779Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:07.7482280Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:07.7482811Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:07.7483426Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:07.7483960Z 2025-05-07T20:32:07.7484282Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:07.7484624Z 2025-05-07T20:32:07.7484789Z moe/activation_test.py:126: 2025-05-07T20:32:07.7485286Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:07.7485847Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:07.7486408Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:07.7487799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:07.7489129Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:07.7490430Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:07.7491637Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:07.7492902Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:07.7494199Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:07.7495527Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:07.7496884Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:07.7498155Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:07.7499226Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:07.7500384Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:07.7501289Z fn() 2025-05-07T20:32:07.7502150Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:07.7503113Z self.fn.run( 2025-05-07T20:32:07.7503883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:07.7504795Z kernel = self.compile( 2025-05-07T20:32:07.7505710Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:07.7506838Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:07.7507519Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:07.7507926Z 2025-05-07T20:32:07.7508272Z self = 2025-05-07T20:32:07.7510134Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:07.7516467Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f06abcf5bd0>} 2025-05-07T20:32:07.7518566Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:07.7520155Z context = 2025-05-07T20:32:07.7520586Z 2025-05-07T20:32:07.7520840Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:07.7521978Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:07.7522725Z module_map=module_map) 2025-05-07T20:32:07.7523307Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:07.7523842Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:07.7524276Z E ^ 2025-05-07T20:32:07.7524980Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:07.7525709Z 2025-05-07T20:32:07.7526353Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:07.7527157Z 2025-05-07T20:32:07.7527332Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:07.7527974Z self=, 2025-05-07T20:32:07.7528615Z T=1, 2025-05-07T20:32:07.7528911Z D=5120, 2025-05-07T20:32:07.7529230Z scale_ub=1200.0, 2025-05-07T20:32:07.7529571Z contiguous=False, 2025-05-07T20:32:07.7529924Z compiled=True, 2025-05-07T20:32:07.7530253Z ) 2025-05-07T20:32:08.1027264Z self = 2025-05-07T20:32:08.1028171Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:08.1028646Z 2025-05-07T20:32:08.1028770Z @given( 2025-05-07T20:32:08.1029142Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:08.1029660Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:08.1030158Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:08.1030702Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:08.1031264Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:08.1031757Z ) 2025-05-07T20:32:08.1032291Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:08.1033026Z def test_silu_mul_quant( 2025-05-07T20:32:08.1033402Z self, 2025-05-07T20:32:08.1033722Z T: int, 2025-05-07T20:32:08.1034039Z D: int, 2025-05-07T20:32:08.1034361Z scale_ub: Optional[float], 2025-05-07T20:32:08.1034790Z contiguous: bool, 2025-05-07T20:32:08.1035166Z compiled: bool, 2025-05-07T20:32:08.1035530Z ) -> None: 2025-05-07T20:32:08.1035891Z torch.manual_seed(2025) 2025-05-07T20:32:08.1036268Z 2025-05-07T20:32:08.1036693Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:08.1037263Z 2025-05-07T20:32:08.1037575Z x_sign = torch.sign(x) 2025-05-07T20:32:08.1038040Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:08.1038544Z x = x_sign * x_clamp 2025-05-07T20:32:08.1038936Z x0 = x[:, :D] 2025-05-07T20:32:08.1039277Z x1 = x[:, D:] 2025-05-07T20:32:08.1039629Z 2025-05-07T20:32:08.1039944Z if contiguous: 2025-05-07T20:32:08.1040324Z x0 = x0.contiguous() 2025-05-07T20:32:08.1040744Z x1 = x1.contiguous() 2025-05-07T20:32:08.1041152Z 2025-05-07T20:32:08.1041473Z if scale_ub is not None: 2025-05-07T20:32:08.1051249Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:08.1051877Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:08.1052765Z ) 2025-05-07T20:32:08.1053141Z else: 2025-05-07T20:32:08.1053502Z scale_ub_tensor = None 2025-05-07T20:32:08.1053933Z 2025-05-07T20:32:08.1054326Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:08.1054856Z op = silu_mul_quant 2025-05-07T20:32:08.1055267Z if compiled: 2025-05-07T20:32:08.1055685Z op = torch.compile(op) 2025-05-07T20:32:08.1056194Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:08.1056659Z 2025-05-07T20:32:08.1056963Z > y_fp8, y_scale = fn() 2025-05-07T20:32:08.1057409Z 2025-05-07T20:32:08.1057576Z moe/activation_test.py:117: 2025-05-07T20:32:08.1058282Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:08.1058849Z moe/activation_test.py:115: in fn 2025-05-07T20:32:08.1059322Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:08.1060434Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:08.1061340Z return fn(*args, **kwargs) 2025-05-07T20:32:08.1062529Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:08.1063710Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:08.1064601Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:08.1065739Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:08.1066893Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:08.1067818Z kernel = self.compile( 2025-05-07T20:32:08.1068753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:08.1069891Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:08.1070565Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:08.1070935Z 2025-05-07T20:32:08.1071285Z self = 2025-05-07T20:32:08.1073015Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:08.1075463Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f06abcf43a0>} 2025-05-07T20:32:08.1077840Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:08.1079641Z context = 2025-05-07T20:32:08.1080146Z 2025-05-07T20:32:08.1080431Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:08.1081292Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:08.1082097Z module_map=module_map) 2025-05-07T20:32:08.1082703Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:08.1083293Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:08.1083713Z E ^ 2025-05-07T20:32:08.1084518Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:08.1085311Z 2025-05-07T20:32:08.1086065Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:08.1086984Z 2025-05-07T20:32:08.1087146Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:08.1087839Z self=, 2025-05-07T20:32:08.1088652Z T=1, 2025-05-07T20:32:08.1088949Z D=5120, 2025-05-07T20:32:08.1089254Z scale_ub=1200.0, 2025-05-07T20:32:08.1089620Z contiguous=False, 2025-05-07T20:32:08.1090348Z compiled=False, 2025-05-07T20:32:08.1090679Z ) 2025-05-07T20:32:08.1091209Z self = 2025-05-07T20:32:08.1092032Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:08.1092491Z 2025-05-07T20:32:08.1092758Z @given( 2025-05-07T20:32:08.1093185Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:08.1093862Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:08.1094373Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:08.1094927Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:08.1095478Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:08.1095966Z ) 2025-05-07T20:32:08.1096547Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:08.1097308Z def test_silu_mul_quant( 2025-05-07T20:32:08.1097716Z self, 2025-05-07T20:32:08.1098019Z T: int, 2025-05-07T20:32:08.1098340Z D: int, 2025-05-07T20:32:08.1098694Z scale_ub: Optional[float], 2025-05-07T20:32:08.1099143Z contiguous: bool, 2025-05-07T20:32:08.1099539Z compiled: bool, 2025-05-07T20:32:08.1100016Z ) -> None: 2025-05-07T20:32:08.1100359Z torch.manual_seed(2025) 2025-05-07T20:32:08.1100756Z 2025-05-07T20:32:08.1101200Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:08.1101779Z 2025-05-07T20:32:08.1102092Z x_sign = torch.sign(x) 2025-05-07T20:32:08.1102554Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:08.1103069Z x = x_sign * x_clamp 2025-05-07T20:32:08.1103468Z x0 = x[:, :D] 2025-05-07T20:32:08.1103813Z x1 = x[:, D:] 2025-05-07T20:32:08.1104163Z 2025-05-07T20:32:08.1104456Z if contiguous: 2025-05-07T20:32:08.1104827Z x0 = x0.contiguous() 2025-05-07T20:32:08.1105251Z x1 = x1.contiguous() 2025-05-07T20:32:08.1105648Z 2025-05-07T20:32:08.1105955Z if scale_ub is not None: 2025-05-07T20:32:08.1106412Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:08.1106977Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:08.1107489Z ) 2025-05-07T20:32:08.1107799Z else: 2025-05-07T20:32:08.1108146Z scale_ub_tensor = None 2025-05-07T20:32:08.1108573Z 2025-05-07T20:32:08.1108947Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:08.1109479Z op = silu_mul_quant 2025-05-07T20:32:08.1109897Z if compiled: 2025-05-07T20:32:08.1110293Z op = torch.compile(op) 2025-05-07T20:32:08.1110798Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:08.1111263Z 2025-05-07T20:32:08.1111567Z > y_fp8, y_scale = fn() 2025-05-07T20:32:08.1111853Z 2025-05-07T20:32:08.1112015Z moe/activation_test.py:117: 2025-05-07T20:32:08.1112513Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:08.1113065Z moe/activation_test.py:115: in fn 2025-05-07T20:32:08.1113540Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:08.1114742Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:08.1115924Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:08.1116827Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:08.1118027Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:08.1119228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:08.1120241Z kernel = self.compile( 2025-05-07T20:32:08.1121079Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:08.1122231Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:08.1122918Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:08.1123316Z 2025-05-07T20:32:08.1123661Z self = 2025-05-07T20:32:08.1125783Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:08.1128268Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f06abcf4ee0>} 2025-05-07T20:32:08.1130701Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:08.1132524Z context = 2025-05-07T20:32:08.1133024Z 2025-05-07T20:32:08.1133297Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:08.1134204Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:08.1135030Z module_map=module_map) 2025-05-07T20:32:08.1135647Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:08.1136231Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:08.1136664Z E ^ 2025-05-07T20:32:08.1137467Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:08.1138274Z 2025-05-07T20:32:08.1139008Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:08.1140076Z 2025-05-07T20:32:08.1140249Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:08.1140955Z self=, 2025-05-07T20:32:08.1141644Z T=16384, 2025-05-07T20:32:08.1141939Z D=5120, 2025-05-07T20:32:08.1142237Z scale_ub=1200.0, 2025-05-07T20:32:08.1142598Z contiguous=False, 2025-05-07T20:32:08.1142956Z compiled=True, 2025-05-07T20:32:08.1143280Z ) 2025-05-07T20:32:08.2135643Z self = 2025-05-07T20:32:08.2136586Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:08.2137077Z 2025-05-07T20:32:08.2137212Z @given( 2025-05-07T20:32:08.2137591Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:08.2138122Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:08.2138607Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:08.2139100Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:08.2139637Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:08.2140250Z ) 2025-05-07T20:32:08.2140854Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:08.2141622Z def test_silu_mul_quant( 2025-05-07T20:32:08.2142028Z self, 2025-05-07T20:32:08.2142354Z T: int, 2025-05-07T20:32:08.2142668Z D: int, 2025-05-07T20:32:08.2143027Z scale_ub: Optional[float], 2025-05-07T20:32:08.2143486Z contiguous: bool, 2025-05-07T20:32:08.2143878Z compiled: bool, 2025-05-07T20:32:08.2144251Z ) -> None: 2025-05-07T20:32:08.2144600Z torch.manual_seed(2025) 2025-05-07T20:32:08.2144996Z 2025-05-07T20:32:08.2145448Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:08.2146347Z 2025-05-07T20:32:08.2146650Z x_sign = torch.sign(x) 2025-05-07T20:32:08.2147130Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:08.2147653Z x = x_sign * x_clamp 2025-05-07T20:32:08.2148047Z x0 = x[:, :D] 2025-05-07T20:32:08.2148401Z x1 = x[:, D:] 2025-05-07T20:32:08.2148740Z 2025-05-07T20:32:08.2149031Z if contiguous: 2025-05-07T20:32:08.2149419Z x0 = x0.contiguous() 2025-05-07T20:32:08.2149849Z x1 = x1.contiguous() 2025-05-07T20:32:08.2150386Z 2025-05-07T20:32:08.2150697Z if scale_ub is not None: 2025-05-07T20:32:08.2151389Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:08.2151960Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:08.2152473Z ) 2025-05-07T20:32:08.2152787Z else: 2025-05-07T20:32:08.2153129Z scale_ub_tensor = None 2025-05-07T20:32:08.2153552Z 2025-05-07T20:32:08.2153928Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:08.2154454Z op = silu_mul_quant 2025-05-07T20:32:08.2154862Z if compiled: 2025-05-07T20:32:08.2155269Z op = torch.compile(op) 2025-05-07T20:32:08.2155766Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:08.2156221Z 2025-05-07T20:32:08.2156534Z > y_fp8, y_scale = fn() 2025-05-07T20:32:08.2156823Z 2025-05-07T20:32:08.2156987Z moe/activation_test.py:117: 2025-05-07T20:32:08.2157484Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:08.2158042Z moe/activation_test.py:115: in fn 2025-05-07T20:32:08.2158522Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:08.2159502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:08.2160470Z return fn(*args, **kwargs) 2025-05-07T20:32:08.2161586Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:08.2162808Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:08.2163747Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:08.2164909Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:08.2166027Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:08.2166957Z kernel = self.compile( 2025-05-07T20:32:08.2167914Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:08.2169072Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:08.2169757Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:08.2170160Z 2025-05-07T20:32:08.2170515Z self = 2025-05-07T20:32:08.2172446Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:08.2174977Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f06abcf69e0>} 2025-05-07T20:32:08.2177432Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:08.2179290Z context = 2025-05-07T20:32:08.2179760Z 2025-05-07T20:32:08.2180131Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:08.2181138Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:08.2181954Z module_map=module_map) 2025-05-07T20:32:08.2182568Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:08.2183158Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:08.2183591Z E ^ 2025-05-07T20:32:08.2184402Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:08.2185283Z 2025-05-07T20:32:08.2186143Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:08.2187075Z 2025-05-07T20:32:08.2187244Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:08.2187949Z self=, 2025-05-07T20:32:08.2188640Z T=2048, 2025-05-07T20:32:08.2188941Z D=7168, 2025-05-07T20:32:08.2189253Z scale_ub=1200.0, 2025-05-07T20:32:08.2189623Z contiguous=False, 2025-05-07T20:32:08.2190304Z compiled=True, 2025-05-07T20:32:08.2190652Z ) 2025-05-07T20:32:08.2191190Z self = 2025-05-07T20:32:08.2192047Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:08.2192524Z 2025-05-07T20:32:08.2192648Z @given( 2025-05-07T20:32:08.2193067Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:08.2193601Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:08.2194110Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:08.2194677Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:08.2195237Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:08.2195711Z ) 2025-05-07T20:32:08.2196311Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:08.2197086Z def test_silu_mul_quant( 2025-05-07T20:32:08.2197484Z self, 2025-05-07T20:32:08.2197793Z T: int, 2025-05-07T20:32:08.2198112Z D: int, 2025-05-07T20:32:08.2198468Z scale_ub: Optional[float], 2025-05-07T20:32:08.2198912Z contiguous: bool, 2025-05-07T20:32:08.2199309Z compiled: bool, 2025-05-07T20:32:08.2199674Z ) -> None: 2025-05-07T20:32:08.2199990Z torch.manual_seed(2025) 2025-05-07T20:32:08.2200366Z 2025-05-07T20:32:08.2200723Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:08.2201169Z 2025-05-07T20:32:08.2201436Z x_sign = torch.sign(x) 2025-05-07T20:32:08.2201869Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:08.2202310Z x = x_sign * x_clamp 2025-05-07T20:32:08.2202669Z x0 = x[:, :D] 2025-05-07T20:32:08.2202990Z x1 = x[:, D:] 2025-05-07T20:32:08.2203278Z 2025-05-07T20:32:08.2203577Z if contiguous: 2025-05-07T20:32:08.2203944Z x0 = x0.contiguous() 2025-05-07T20:32:08.2204340Z x1 = x1.contiguous() 2025-05-07T20:32:08.2204715Z 2025-05-07T20:32:08.2205020Z if scale_ub is not None: 2025-05-07T20:32:08.2205464Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:08.2205983Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:08.2206487Z ) 2025-05-07T20:32:08.2206798Z else: 2025-05-07T20:32:08.2207120Z scale_ub_tensor = None 2025-05-07T20:32:08.2207533Z 2025-05-07T20:32:08.2207926Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:08.2208426Z op = silu_mul_quant 2025-05-07T20:32:08.2208842Z if compiled: 2025-05-07T20:32:08.2209253Z op = torch.compile(op) 2025-05-07T20:32:08.2209750Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:08.2210208Z 2025-05-07T20:32:08.2210513Z > y_fp8, y_scale = fn() 2025-05-07T20:32:08.2210941Z 2025-05-07T20:32:08.2211099Z moe/activation_test.py:117: 2025-05-07T20:32:08.2211600Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:08.2212137Z moe/activation_test.py:115: in fn 2025-05-07T20:32:08.2212600Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:08.2213556Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:08.2214516Z return fn(*args, **kwargs) 2025-05-07T20:32:08.2215677Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:08.2217148Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:08.2218098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:08.2219319Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:08.2220603Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:08.2221555Z kernel = self.compile( 2025-05-07T20:32:08.2222478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:08.2223608Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:08.2224276Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:08.2224662Z 2025-05-07T20:32:08.2225009Z self = 2025-05-07T20:32:08.2226951Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:08.2229304Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f06abcf7b50>} 2025-05-07T20:32:08.2231659Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:08.2233484Z context = 2025-05-07T20:32:08.2233980Z 2025-05-07T20:32:08.2234238Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:08.2235126Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:08.2235925Z module_map=module_map) 2025-05-07T20:32:08.2236508Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:08.2237004Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:08.2237378Z E ^ 2025-05-07T20:32:08.2238092Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:08.2238817Z 2025-05-07T20:32:08.2239484Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:08.2240303Z 2025-05-07T20:32:08.3522466Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:08.3523328Z self=, 2025-05-07T20:32:08.3524022Z T=1, 2025-05-07T20:32:08.3524324Z D=5120, 2025-05-07T20:32:08.3524670Z scale_ub=None, 2025-05-07T20:32:08.3525034Z contiguous=False, 2025-05-07T20:32:08.3525392Z compiled=False, 2025-05-07T20:32:08.3525705Z ) 2025-05-07T20:32:08.3526226Z self = 2025-05-07T20:32:08.3526991Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:08.3527448Z 2025-05-07T20:32:08.3527574Z @given( 2025-05-07T20:32:08.3528275Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:08.3528795Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:08.3529312Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:08.3529873Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:08.3530422Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:08.3530911Z ) 2025-05-07T20:32:08.3531512Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:08.3532286Z def test_silu_mul_quant( 2025-05-07T20:32:08.3532817Z self, 2025-05-07T20:32:08.3533133Z T: int, 2025-05-07T20:32:08.3533454Z D: int, 2025-05-07T20:32:08.3533984Z scale_ub: Optional[float], 2025-05-07T20:32:08.3534450Z contiguous: bool, 2025-05-07T20:32:08.3534850Z compiled: bool, 2025-05-07T20:32:08.3535210Z ) -> None: 2025-05-07T20:32:08.3535558Z torch.manual_seed(2025) 2025-05-07T20:32:08.3535968Z 2025-05-07T20:32:08.3536407Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:08.3536987Z 2025-05-07T20:32:08.3537300Z x_sign = torch.sign(x) 2025-05-07T20:32:08.3537784Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:08.3538307Z x = x_sign * x_clamp 2025-05-07T20:32:08.3538706Z x0 = x[:, :D] 2025-05-07T20:32:08.3539053Z x1 = x[:, D:] 2025-05-07T20:32:08.3539398Z 2025-05-07T20:32:08.3539702Z if contiguous: 2025-05-07T20:32:08.3540240Z x0 = x0.contiguous() 2025-05-07T20:32:08.3540670Z x1 = x1.contiguous() 2025-05-07T20:32:08.3541070Z 2025-05-07T20:32:08.3541396Z if scale_ub is not None: 2025-05-07T20:32:08.3541846Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:08.3542404Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:08.3542920Z ) 2025-05-07T20:32:08.3543222Z else: 2025-05-07T20:32:08.3543571Z scale_ub_tensor = None 2025-05-07T20:32:08.3543987Z 2025-05-07T20:32:08.3544355Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:08.3544886Z op = silu_mul_quant 2025-05-07T20:32:08.3545299Z if compiled: 2025-05-07T20:32:08.3545693Z op = torch.compile(op) 2025-05-07T20:32:08.3546186Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:08.3546646Z 2025-05-07T20:32:08.3546946Z > y_fp8, y_scale = fn() 2025-05-07T20:32:08.3547230Z 2025-05-07T20:32:08.3547397Z moe/activation_test.py:117: 2025-05-07T20:32:08.3547877Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:08.3548423Z moe/activation_test.py:115: in fn 2025-05-07T20:32:08.3548871Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:08.3550054Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:08.3551254Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:08.3552143Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:08.3553348Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:08.3554514Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:08.3555457Z kernel = self.compile( 2025-05-07T20:32:08.3556402Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:08.3557577Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:08.3558258Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:08.3558655Z 2025-05-07T20:32:08.3559014Z self = 2025-05-07T20:32:08.3561055Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:08.3563577Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f06abe945e0>} 2025-05-07T20:32:08.3566018Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:08.3567997Z context = 2025-05-07T20:32:08.3568513Z 2025-05-07T20:32:08.3568789Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:08.3569705Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:08.3570523Z module_map=module_map) 2025-05-07T20:32:08.3571142Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:08.3571730Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:08.3572165Z E ^ 2025-05-07T20:32:08.3572979Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:08.3573790Z 2025-05-07T20:32:08.3574531Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:08.3575476Z 2025-05-07T20:32:08.3575644Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:08.3576360Z self=, 2025-05-07T20:32:08.3577050Z T=4096, 2025-05-07T20:32:08.3577347Z D=7168, 2025-05-07T20:32:08.3577660Z scale_ub=1200.0, 2025-05-07T20:32:08.3578028Z contiguous=False, 2025-05-07T20:32:08.3578425Z compiled=False, 2025-05-07T20:32:08.3578753Z ) 2025-05-07T20:32:08.3579288Z self = 2025-05-07T20:32:08.3590744Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:08.3591185Z 2025-05-07T20:32:08.3591295Z @given( 2025-05-07T20:32:08.3591612Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:08.3592067Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:08.3592533Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:08.3593029Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:08.3593498Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:08.3593906Z ) 2025-05-07T20:32:08.3594422Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:08.3595078Z def test_silu_mul_quant( 2025-05-07T20:32:08.3595420Z self, 2025-05-07T20:32:08.3595701Z T: int, 2025-05-07T20:32:08.3595983Z D: int, 2025-05-07T20:32:08.3596298Z scale_ub: Optional[float], 2025-05-07T20:32:08.3596707Z contiguous: bool, 2025-05-07T20:32:08.3597070Z compiled: bool, 2025-05-07T20:32:08.3597388Z ) -> None: 2025-05-07T20:32:08.3597693Z torch.manual_seed(2025) 2025-05-07T20:32:08.3598042Z 2025-05-07T20:32:08.3598427Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:08.3598929Z 2025-05-07T20:32:08.3599210Z x_sign = torch.sign(x) 2025-05-07T20:32:08.3599647Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:08.3600148Z x = x_sign * x_clamp 2025-05-07T20:32:08.3600526Z x0 = x[:, :D] 2025-05-07T20:32:08.3600864Z x1 = x[:, D:] 2025-05-07T20:32:08.3601180Z 2025-05-07T20:32:08.3601471Z if contiguous: 2025-05-07T20:32:08.3601817Z x0 = x0.contiguous() 2025-05-07T20:32:08.3602225Z x1 = x1.contiguous() 2025-05-07T20:32:08.3602793Z 2025-05-07T20:32:08.3603106Z if scale_ub is not None: 2025-05-07T20:32:08.3603554Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:08.3604117Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:08.3604645Z ) 2025-05-07T20:32:08.3604952Z else: 2025-05-07T20:32:08.3605295Z scale_ub_tensor = None 2025-05-07T20:32:08.3605709Z 2025-05-07T20:32:08.3606062Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:08.3606576Z op = silu_mul_quant 2025-05-07T20:32:08.3607122Z if compiled: 2025-05-07T20:32:08.3607512Z op = torch.compile(op) 2025-05-07T20:32:08.3608158Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:08.3608605Z 2025-05-07T20:32:08.3608903Z > y_fp8, y_scale = fn() 2025-05-07T20:32:08.3609165Z 2025-05-07T20:32:08.3609320Z moe/activation_test.py:117: 2025-05-07T20:32:08.3609807Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:08.3610379Z moe/activation_test.py:115: in fn 2025-05-07T20:32:08.3610845Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:08.3612097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:08.3613347Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:08.3614294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:08.3615522Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:08.3616708Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:08.3617665Z kernel = self.compile( 2025-05-07T20:32:08.3618612Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:08.3619906Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:08.3620598Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:08.3620998Z 2025-05-07T20:32:08.3621356Z self = 2025-05-07T20:32:08.3623346Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:08.3625871Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f06abe94ca0>} 2025-05-07T20:32:08.3628320Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:08.3630169Z context = 2025-05-07T20:32:08.3630679Z 2025-05-07T20:32:08.3630969Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:08.3631867Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:08.3632689Z module_map=module_map) 2025-05-07T20:32:08.3633305Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:08.3633895Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:08.3634344Z E ^ 2025-05-07T20:32:08.3635154Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:08.3635964Z 2025-05-07T20:32:08.3636712Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:08.3637646Z 2025-05-07T20:32:08.3637912Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:08.3638622Z self=, 2025-05-07T20:32:08.3639306Z T=16384, 2025-05-07T20:32:08.3639615Z D=7168, 2025-05-07T20:32:08.3639926Z scale_ub=None, 2025-05-07T20:32:08.3640266Z contiguous=True, 2025-05-07T20:32:08.3640627Z compiled=True, 2025-05-07T20:32:08.3640960Z ) 2025-05-07T20:32:08.5595046Z self = 2025-05-07T20:32:08.5595990Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:08.5596788Z 2025-05-07T20:32:08.5596916Z @given( 2025-05-07T20:32:08.5597498Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:08.5597999Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:08.5598478Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:08.5598955Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:08.5599498Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:08.5599982Z ) 2025-05-07T20:32:08.5600581Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:08.5601340Z def test_silu_mul_quant( 2025-05-07T20:32:08.5601746Z self, 2025-05-07T20:32:08.5602070Z T: int, 2025-05-07T20:32:08.5602390Z D: int, 2025-05-07T20:32:08.5602740Z scale_ub: Optional[float], 2025-05-07T20:32:08.5603193Z contiguous: bool, 2025-05-07T20:32:08.5603590Z compiled: bool, 2025-05-07T20:32:08.5603956Z ) -> None: 2025-05-07T20:32:08.5604309Z torch.manual_seed(2025) 2025-05-07T20:32:08.5604712Z 2025-05-07T20:32:08.5605165Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:08.5605745Z 2025-05-07T20:32:08.5606058Z x_sign = torch.sign(x) 2025-05-07T20:32:08.5606537Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:08.5607069Z x = x_sign * x_clamp 2025-05-07T20:32:08.5607468Z x0 = x[:, :D] 2025-05-07T20:32:08.5607816Z x1 = x[:, D:] 2025-05-07T20:32:08.5608154Z 2025-05-07T20:32:08.5608453Z if contiguous: 2025-05-07T20:32:08.5608827Z x0 = x0.contiguous() 2025-05-07T20:32:08.5609256Z x1 = x1.contiguous() 2025-05-07T20:32:08.5609655Z 2025-05-07T20:32:08.5609959Z if scale_ub is not None: 2025-05-07T20:32:08.5610416Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:08.5610977Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:08.5611500Z ) 2025-05-07T20:32:08.5611802Z else: 2025-05-07T20:32:08.5612151Z scale_ub_tensor = None 2025-05-07T20:32:08.5612580Z 2025-05-07T20:32:08.5612950Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:08.5613481Z op = silu_mul_quant 2025-05-07T20:32:08.5613892Z if compiled: 2025-05-07T20:32:08.5614294Z op = torch.compile(op) 2025-05-07T20:32:08.5614795Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:08.5615261Z 2025-05-07T20:32:08.5615565Z > y_fp8, y_scale = fn() 2025-05-07T20:32:08.5615854Z 2025-05-07T20:32:08.5616014Z moe/activation_test.py:117: 2025-05-07T20:32:08.5616510Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:08.5617071Z moe/activation_test.py:115: in fn 2025-05-07T20:32:08.5617533Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:08.5618511Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:08.5619503Z return fn(*args, **kwargs) 2025-05-07T20:32:08.5620756Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:08.5621960Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:08.5622850Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:08.5624197Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:08.5625326Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:08.5626236Z kernel = self.compile( 2025-05-07T20:32:08.5627192Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:08.5628351Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:08.5629120Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:08.5629634Z 2025-05-07T20:32:08.5629984Z self = 2025-05-07T20:32:08.5631927Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:08.5634487Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f06abe95b40>} 2025-05-07T20:32:08.5636928Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:08.5638787Z context = 2025-05-07T20:32:08.5639288Z 2025-05-07T20:32:08.5639566Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:08.5640437Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:08.5641240Z module_map=module_map) 2025-05-07T20:32:08.5641866Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:08.5642463Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:08.5642888Z E ^ 2025-05-07T20:32:08.5643697Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:08.5644508Z 2025-05-07T20:32:08.5645260Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:08.5646180Z 2025-05-07T20:32:08.5646356Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:08.5647064Z self=, 2025-05-07T20:32:08.5647758Z T=4096, 2025-05-07T20:32:08.5648073Z D=5120, 2025-05-07T20:32:08.5648377Z scale_ub=None, 2025-05-07T20:32:08.5648728Z contiguous=False, 2025-05-07T20:32:08.5649100Z compiled=True, 2025-05-07T20:32:08.5649431Z ) 2025-05-07T20:32:08.5649970Z self = 2025-05-07T20:32:08.5650830Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:08.5651298Z 2025-05-07T20:32:08.5651431Z @given( 2025-05-07T20:32:08.5651803Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:08.5652330Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:08.5652850Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:08.5653401Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:08.5653965Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:08.5654451Z ) 2025-05-07T20:32:08.5655049Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:08.5655823Z def test_silu_mul_quant( 2025-05-07T20:32:08.5656227Z self, 2025-05-07T20:32:08.5656533Z T: int, 2025-05-07T20:32:08.5656855Z D: int, 2025-05-07T20:32:08.5657213Z scale_ub: Optional[float], 2025-05-07T20:32:08.5657755Z contiguous: bool, 2025-05-07T20:32:08.5658154Z compiled: bool, 2025-05-07T20:32:08.5658522Z ) -> None: 2025-05-07T20:32:08.5658872Z torch.manual_seed(2025) 2025-05-07T20:32:08.5659254Z 2025-05-07T20:32:08.5659677Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:08.5660255Z 2025-05-07T20:32:08.5660498Z x_sign = torch.sign(x) 2025-05-07T20:32:08.5660892Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:08.5661324Z x = x_sign * x_clamp 2025-05-07T20:32:08.5661737Z x0 = x[:, :D] 2025-05-07T20:32:08.5662024Z x1 = x[:, D:] 2025-05-07T20:32:08.5662300Z 2025-05-07T20:32:08.5662692Z if contiguous: 2025-05-07T20:32:08.5663044Z x0 = x0.contiguous() 2025-05-07T20:32:08.5663433Z x1 = x1.contiguous() 2025-05-07T20:32:08.5663759Z 2025-05-07T20:32:08.5664023Z if scale_ub is not None: 2025-05-07T20:32:08.5664420Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:08.5664905Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:08.5665364Z ) 2025-05-07T20:32:08.5665643Z else: 2025-05-07T20:32:08.5665942Z scale_ub_tensor = None 2025-05-07T20:32:08.5666293Z 2025-05-07T20:32:08.5666635Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:08.5667106Z op = silu_mul_quant 2025-05-07T20:32:08.5667466Z if compiled: 2025-05-07T20:32:08.5667814Z op = torch.compile(op) 2025-05-07T20:32:08.5668235Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:08.5668623Z 2025-05-07T20:32:08.5668902Z > y_fp8, y_scale = fn() 2025-05-07T20:32:08.5669166Z 2025-05-07T20:32:08.5669311Z moe/activation_test.py:117: 2025-05-07T20:32:08.5669746Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:08.5670216Z moe/activation_test.py:115: in fn 2025-05-07T20:32:08.5670648Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:08.5671586Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:08.5672528Z return fn(*args, **kwargs) 2025-05-07T20:32:08.5673650Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:08.5674825Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:08.5675715Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:08.5676872Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:08.5677999Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:08.5678896Z kernel = self.compile( 2025-05-07T20:32:08.5679791Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:08.5680829Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:08.5681457Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:08.5681822Z 2025-05-07T20:32:08.5682162Z self = 2025-05-07T20:32:08.5683978Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:08.5686429Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f06abe95240>} 2025-05-07T20:32:08.5688791Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:08.5691093Z context = 2025-05-07T20:32:08.5691588Z 2025-05-07T20:32:08.5691864Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:08.5692758Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:08.5693601Z module_map=module_map) 2025-05-07T20:32:08.5694189Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:08.5694932Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:08.5695376Z E ^ 2025-05-07T20:32:08.5696358Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:08.5697171Z 2025-05-07T20:32:08.5697920Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:08.5698841Z 2025-05-07T20:32:08.9227839Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:08.9228573Z self=, 2025-05-07T20:32:08.9229064Z T=4096, 2025-05-07T20:32:08.9229265Z D=5120, 2025-05-07T20:32:08.9229469Z scale_ub=1200.0, 2025-05-07T20:32:08.9229702Z contiguous=False, 2025-05-07T20:32:08.9229938Z compiled=False, 2025-05-07T20:32:08.9230153Z ) 2025-05-07T20:32:08.9230483Z self = 2025-05-07T20:32:08.9231014Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:08.9231293Z 2025-05-07T20:32:08.9231391Z @given( 2025-05-07T20:32:08.9231625Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:08.9231945Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:08.9232255Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:08.9232590Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:08.9232916Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:08.9233207Z ) 2025-05-07T20:32:08.9233566Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:08.9234002Z def test_silu_mul_quant( 2025-05-07T20:32:08.9234247Z self, 2025-05-07T20:32:08.9234445Z T: int, 2025-05-07T20:32:08.9234641Z D: int, 2025-05-07T20:32:08.9234869Z scale_ub: Optional[float], 2025-05-07T20:32:08.9235141Z contiguous: bool, 2025-05-07T20:32:08.9235385Z compiled: bool, 2025-05-07T20:32:08.9235616Z ) -> None: 2025-05-07T20:32:08.9235841Z torch.manual_seed(2025) 2025-05-07T20:32:08.9236084Z 2025-05-07T20:32:08.9236360Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:08.9236707Z 2025-05-07T20:32:08.9236907Z x_sign = torch.sign(x) 2025-05-07T20:32:08.9237195Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:08.9237508Z x = x_sign * x_clamp 2025-05-07T20:32:08.9237752Z x0 = x[:, :D] 2025-05-07T20:32:08.9237967Z x1 = x[:, D:] 2025-05-07T20:32:08.9238177Z 2025-05-07T20:32:08.9238363Z if contiguous: 2025-05-07T20:32:08.9238594Z x0 = x0.contiguous() 2025-05-07T20:32:08.9238859Z x1 = x1.contiguous() 2025-05-07T20:32:08.9239105Z 2025-05-07T20:32:08.9239299Z if scale_ub is not None: 2025-05-07T20:32:08.9239580Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:08.9239931Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:08.9240238Z ) 2025-05-07T20:32:08.9240446Z else: 2025-05-07T20:32:08.9240664Z scale_ub_tensor = None 2025-05-07T20:32:08.9240916Z 2025-05-07T20:32:08.9241155Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:08.9241472Z op = silu_mul_quant 2025-05-07T20:32:08.9242037Z if compiled: 2025-05-07T20:32:08.9242285Z op = torch.compile(op) 2025-05-07T20:32:08.9242587Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:08.9242866Z 2025-05-07T20:32:08.9243056Z > y_fp8, y_scale = fn() 2025-05-07T20:32:08.9243230Z 2025-05-07T20:32:08.9243333Z moe/activation_test.py:117: 2025-05-07T20:32:08.9243633Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:08.9243961Z moe/activation_test.py:115: in fn 2025-05-07T20:32:08.9244242Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:08.9245167Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:08.9245861Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:08.9246393Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:08.9247075Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:08.9247739Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:08.9248263Z kernel = self.compile( 2025-05-07T20:32:08.9248805Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:08.9249463Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:08.9249859Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:08.9250088Z 2025-05-07T20:32:08.9250295Z self = 2025-05-07T20:32:08.9251374Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:08.9252769Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f06abe96cb0>} 2025-05-07T20:32:08.9254104Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:08.9255121Z context = 2025-05-07T20:32:08.9255407Z 2025-05-07T20:32:08.9255576Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:08.9256100Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:08.9256567Z module_map=module_map) 2025-05-07T20:32:08.9256929Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:08.9257283Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:08.9257550Z E ^ 2025-05-07T20:32:08.9258011Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:08.9258462Z 2025-05-07T20:32:08.9258886Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:08.9259399Z 2025-05-07T20:32:08.9259504Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:08.9260001Z self=, 2025-05-07T20:32:08.9260406Z T=4096, 2025-05-07T20:32:08.9260601Z D=5120, 2025-05-07T20:32:08.9260801Z scale_ub=1200.0, 2025-05-07T20:32:08.9261030Z contiguous=False, 2025-05-07T20:32:08.9261268Z compiled=True, 2025-05-07T20:32:08.9261477Z ) 2025-05-07T20:32:08.9261801Z self = 2025-05-07T20:32:08.9262294Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:08.9262632Z 2025-05-07T20:32:08.9262712Z @given( 2025-05-07T20:32:08.9262970Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:08.9263334Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:08.9263645Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:08.9263968Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:08.9264303Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:08.9264595Z ) 2025-05-07T20:32:08.9275016Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:08.9275603Z def test_silu_mul_quant( 2025-05-07T20:32:08.9275859Z self, 2025-05-07T20:32:08.9276148Z T: int, 2025-05-07T20:32:08.9276353Z D: int, 2025-05-07T20:32:08.9276582Z scale_ub: Optional[float], 2025-05-07T20:32:08.9276861Z contiguous: bool, 2025-05-07T20:32:08.9277108Z compiled: bool, 2025-05-07T20:32:08.9277350Z ) -> None: 2025-05-07T20:32:08.9277578Z torch.manual_seed(2025) 2025-05-07T20:32:08.9277823Z 2025-05-07T20:32:08.9278111Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:08.9278462Z 2025-05-07T20:32:08.9278661Z x_sign = torch.sign(x) 2025-05-07T20:32:08.9278964Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:08.9279281Z x = x_sign * x_clamp 2025-05-07T20:32:08.9279529Z x0 = x[:, :D] 2025-05-07T20:32:08.9279755Z x1 = x[:, D:] 2025-05-07T20:32:08.9279994Z 2025-05-07T20:32:08.9280195Z if contiguous: 2025-05-07T20:32:08.9280428Z x0 = x0.contiguous() 2025-05-07T20:32:08.9280708Z x1 = x1.contiguous() 2025-05-07T20:32:08.9280958Z 2025-05-07T20:32:08.9281154Z if scale_ub is not None: 2025-05-07T20:32:08.9281435Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:08.9281775Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:08.9282090Z ) 2025-05-07T20:32:08.9282293Z else: 2025-05-07T20:32:08.9282516Z scale_ub_tensor = None 2025-05-07T20:32:08.9282779Z 2025-05-07T20:32:08.9283014Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:08.9283336Z op = silu_mul_quant 2025-05-07T20:32:08.9283593Z if compiled: 2025-05-07T20:32:08.9283841Z op = torch.compile(op) 2025-05-07T20:32:08.9284148Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:08.9284430Z 2025-05-07T20:32:08.9284624Z > y_fp8, y_scale = fn() 2025-05-07T20:32:08.9284795Z 2025-05-07T20:32:08.9284898Z moe/activation_test.py:117: 2025-05-07T20:32:08.9285207Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:08.9285541Z moe/activation_test.py:115: in fn 2025-05-07T20:32:08.9285830Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:08.9286400Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:08.9286965Z return fn(*args, **kwargs) 2025-05-07T20:32:08.9287618Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:08.9288312Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:08.9288852Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:08.9289527Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:08.9290535Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:08.9291076Z kernel = self.compile( 2025-05-07T20:32:08.9291619Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:08.9292264Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:08.9292790Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:08.9293022Z 2025-05-07T20:32:08.9293273Z self = 2025-05-07T20:32:08.9294351Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:08.9295709Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f06abe96b90>} 2025-05-07T20:32:08.9297228Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:08.9298264Z context = 2025-05-07T20:32:08.9298550Z 2025-05-07T20:32:08.9298728Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:08.9299240Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:08.9299705Z module_map=module_map) 2025-05-07T20:32:08.9300189Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:08.9300552Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:08.9300808Z E ^ 2025-05-07T20:32:08.9301277Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:08.9301723Z 2025-05-07T20:32:08.9302150Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:08.9302653Z 2025-05-07T20:32:09.0593713Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:09.0594445Z self=, 2025-05-07T20:32:09.0594993Z T=2048, 2025-05-07T20:32:09.0595188Z D=7168, 2025-05-07T20:32:09.0595393Z scale_ub=1200.0, 2025-05-07T20:32:09.0595632Z contiguous=False, 2025-05-07T20:32:09.0595863Z compiled=False, 2025-05-07T20:32:09.0596081Z ) 2025-05-07T20:32:09.0596415Z self = 2025-05-07T20:32:09.0596916Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:09.0597213Z 2025-05-07T20:32:09.0597295Z @given( 2025-05-07T20:32:09.0597540Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:09.0597868Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:09.0598188Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:09.0598530Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:09.0598868Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:09.0599153Z ) 2025-05-07T20:32:09.0599514Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:09.0599962Z def test_silu_mul_quant( 2025-05-07T20:32:09.0600211Z self, 2025-05-07T20:32:09.0600419Z T: int, 2025-05-07T20:32:09.0600626Z D: int, 2025-05-07T20:32:09.0600847Z scale_ub: Optional[float], 2025-05-07T20:32:09.0601130Z contiguous: bool, 2025-05-07T20:32:09.0601387Z compiled: bool, 2025-05-07T20:32:09.0601615Z ) -> None: 2025-05-07T20:32:09.0601846Z torch.manual_seed(2025) 2025-05-07T20:32:09.0602091Z 2025-05-07T20:32:09.0602368Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:09.0602720Z 2025-05-07T20:32:09.0602919Z x_sign = torch.sign(x) 2025-05-07T20:32:09.0603210Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:09.0603523Z x = x_sign * x_clamp 2025-05-07T20:32:09.0604046Z x0 = x[:, :D] 2025-05-07T20:32:09.0604269Z x1 = x[:, D:] 2025-05-07T20:32:09.0604477Z 2025-05-07T20:32:09.0604673Z if contiguous: 2025-05-07T20:32:09.0604913Z x0 = x0.contiguous() 2025-05-07T20:32:09.0605172Z x1 = x1.contiguous() 2025-05-07T20:32:09.0605421Z 2025-05-07T20:32:09.0605622Z if scale_ub is not None: 2025-05-07T20:32:09.0605894Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:09.0606235Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:09.0606548Z ) 2025-05-07T20:32:09.0606833Z else: 2025-05-07T20:32:09.0607052Z scale_ub_tensor = None 2025-05-07T20:32:09.0607316Z 2025-05-07T20:32:09.0607690Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:09.0608011Z op = silu_mul_quant 2025-05-07T20:32:09.0608270Z if compiled: 2025-05-07T20:32:09.0608520Z op = torch.compile(op) 2025-05-07T20:32:09.0608828Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:09.0609110Z 2025-05-07T20:32:09.0609304Z > y_fp8, y_scale = fn() 2025-05-07T20:32:09.0609481Z 2025-05-07T20:32:09.0609584Z moe/activation_test.py:117: 2025-05-07T20:32:09.0609887Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:09.0610224Z moe/activation_test.py:115: in fn 2025-05-07T20:32:09.0610507Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:09.0611205Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:09.0611903Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:09.0612452Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:09.0613144Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:09.0613817Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:09.0614358Z kernel = self.compile( 2025-05-07T20:32:09.0614900Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:09.0615559Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:09.0615959Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:09.0616188Z 2025-05-07T20:32:09.0616406Z self = 2025-05-07T20:32:09.0617489Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:09.0618880Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f06ab5dc5e0>} 2025-05-07T20:32:09.0620308Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:09.0621333Z context = 2025-05-07T20:32:09.0621623Z 2025-05-07T20:32:09.0621802Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:09.0622328Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:09.0622806Z module_map=module_map) 2025-05-07T20:32:09.0623211Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:09.0623588Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:09.0623852Z E ^ 2025-05-07T20:32:09.0624320Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:09.0624831Z 2025-05-07T20:32:09.0625256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:09.0625766Z 2025-05-07T20:32:09.0625871Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:09.0626294Z self=, 2025-05-07T20:32:09.0626700Z T=1, 2025-05-07T20:32:09.0626885Z D=7168, 2025-05-07T20:32:09.0627082Z scale_ub=None, 2025-05-07T20:32:09.0627352Z contiguous=True, 2025-05-07T20:32:09.0627577Z compiled=False, 2025-05-07T20:32:09.0627786Z ) 2025-05-07T20:32:09.0628189Z self = 2025-05-07T20:32:09.0628681Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:09.0628951Z 2025-05-07T20:32:09.0629035Z @given( 2025-05-07T20:32:09.0629273Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:09.0629594Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:09.0629903Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:09.0630237Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:09.0630566Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:09.0630851Z ) 2025-05-07T20:32:09.0631209Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:09.0631658Z def test_silu_mul_quant( 2025-05-07T20:32:09.0631903Z self, 2025-05-07T20:32:09.0632110Z T: int, 2025-05-07T20:32:09.0632311Z D: int, 2025-05-07T20:32:09.0632535Z scale_ub: Optional[float], 2025-05-07T20:32:09.0632815Z contiguous: bool, 2025-05-07T20:32:09.0633066Z compiled: bool, 2025-05-07T20:32:09.0633298Z ) -> None: 2025-05-07T20:32:09.0633517Z torch.manual_seed(2025) 2025-05-07T20:32:09.0633768Z 2025-05-07T20:32:09.0634053Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:09.0634396Z 2025-05-07T20:32:09.0634604Z x_sign = torch.sign(x) 2025-05-07T20:32:09.0634908Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:09.0635218Z x = x_sign * x_clamp 2025-05-07T20:32:09.0635472Z x0 = x[:, :D] 2025-05-07T20:32:09.0635698Z x1 = x[:, D:] 2025-05-07T20:32:09.0635906Z 2025-05-07T20:32:09.0636102Z if contiguous: 2025-05-07T20:32:09.0636342Z x0 = x0.contiguous() 2025-05-07T20:32:09.0636608Z x1 = x1.contiguous() 2025-05-07T20:32:09.0636858Z 2025-05-07T20:32:09.0637062Z if scale_ub is not None: 2025-05-07T20:32:09.0637340Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:09.0637684Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:09.0637994Z ) 2025-05-07T20:32:09.0638196Z else: 2025-05-07T20:32:09.0638409Z scale_ub_tensor = None 2025-05-07T20:32:09.0638675Z 2025-05-07T20:32:09.0638915Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:09.0639230Z op = silu_mul_quant 2025-05-07T20:32:09.0639491Z if compiled: 2025-05-07T20:32:09.0639748Z op = torch.compile(op) 2025-05-07T20:32:09.0640046Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:09.0640328Z 2025-05-07T20:32:09.0640534Z > y_fp8, y_scale = fn() 2025-05-07T20:32:09.0640702Z 2025-05-07T20:32:09.0640802Z moe/activation_test.py:117: 2025-05-07T20:32:09.0641107Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:09.0641445Z moe/activation_test.py:115: in fn 2025-05-07T20:32:09.0641736Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:09.0642429Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:09.0643125Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:09.0643722Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:09.0644410Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:09.0645079Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:09.0645615Z kernel = self.compile( 2025-05-07T20:32:09.0646162Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:09.0646870Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:09.0647352Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:09.0647582Z 2025-05-07T20:32:09.0647802Z self = 2025-05-07T20:32:09.0648886Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:09.0650244Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f06ab5dcd30>} 2025-05-07T20:32:09.0651587Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:09.0652618Z context = 2025-05-07T20:32:09.0652913Z 2025-05-07T20:32:09.0653108Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:09.0653672Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:09.0654150Z module_map=module_map) 2025-05-07T20:32:09.0654528Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:09.0654891Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:09.0655154Z E ^ 2025-05-07T20:32:09.0655627Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:09.0656073Z 2025-05-07T20:32:09.0656497Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:09.0657011Z 2025-05-07T20:32:09.0657118Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:09.0657544Z self=, 2025-05-07T20:32:09.0657958Z T=16384, 2025-05-07T20:32:09.0658158Z D=7168, 2025-05-07T20:32:09.0658363Z scale_ub=1200.0, 2025-05-07T20:32:09.0658599Z contiguous=False, 2025-05-07T20:32:09.0658826Z compiled=True, 2025-05-07T20:32:09.3328847Z ) 2025-05-07T20:32:09.3329418Z self = 2025-05-07T20:32:09.3330126Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:09.3330441Z 2025-05-07T20:32:09.3330521Z @given( 2025-05-07T20:32:09.3330764Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:09.3331076Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:09.3331376Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:09.3331713Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:09.3332066Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:09.3332350Z ) 2025-05-07T20:32:09.3332713Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:09.3333158Z def test_silu_mul_quant( 2025-05-07T20:32:09.3333396Z self, 2025-05-07T20:32:09.3333598Z T: int, 2025-05-07T20:32:09.3334058Z D: int, 2025-05-07T20:32:09.3334281Z scale_ub: Optional[float], 2025-05-07T20:32:09.3334548Z contiguous: bool, 2025-05-07T20:32:09.3334794Z compiled: bool, 2025-05-07T20:32:09.3335033Z ) -> None: 2025-05-07T20:32:09.3335244Z torch.manual_seed(2025) 2025-05-07T20:32:09.3335485Z 2025-05-07T20:32:09.3335761Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:09.3336097Z 2025-05-07T20:32:09.3336292Z x_sign = torch.sign(x) 2025-05-07T20:32:09.3336587Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:09.3336993Z x = x_sign * x_clamp 2025-05-07T20:32:09.3337237Z x0 = x[:, :D] 2025-05-07T20:32:09.3337593Z x1 = x[:, D:] 2025-05-07T20:32:09.3337803Z 2025-05-07T20:32:09.3337993Z if contiguous: 2025-05-07T20:32:09.3338228Z x0 = x0.contiguous() 2025-05-07T20:32:09.3338480Z x1 = x1.contiguous() 2025-05-07T20:32:09.3338720Z 2025-05-07T20:32:09.3338920Z if scale_ub is not None: 2025-05-07T20:32:09.3339193Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:09.3339528Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:09.3339950Z ) 2025-05-07T20:32:09.3340148Z else: 2025-05-07T20:32:09.3340358Z scale_ub_tensor = None 2025-05-07T20:32:09.3340611Z 2025-05-07T20:32:09.3340845Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:09.3341151Z op = silu_mul_quant 2025-05-07T20:32:09.3341405Z if compiled: 2025-05-07T20:32:09.3341661Z op = torch.compile(op) 2025-05-07T20:32:09.3341952Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:09.3342234Z 2025-05-07T20:32:09.3342434Z > y_fp8, y_scale = fn() 2025-05-07T20:32:09.3342598Z 2025-05-07T20:32:09.3342698Z moe/activation_test.py:117: 2025-05-07T20:32:09.3343000Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:09.3343337Z moe/activation_test.py:115: in fn 2025-05-07T20:32:09.3343621Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:09.3344173Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:09.3344731Z return fn(*args, **kwargs) 2025-05-07T20:32:09.3345388Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:09.3346072Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:09.3346612Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:09.3347295Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:09.3347961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:09.3348487Z kernel = self.compile( 2025-05-07T20:32:09.3349032Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:09.3349683Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:09.3350072Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:09.3350303Z 2025-05-07T20:32:09.3350509Z self = 2025-05-07T20:32:09.3351586Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:09.3352961Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f06ab5ddbd0>} 2025-05-07T20:32:09.3354289Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:09.3355367Z context = 2025-05-07T20:32:09.3355659Z 2025-05-07T20:32:09.3355824Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:09.3356340Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:09.3356804Z module_map=module_map) 2025-05-07T20:32:09.3357205Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:09.3357630Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:09.3357890Z E ^ 2025-05-07T20:32:09.3358373Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:09.3358813Z 2025-05-07T20:32:09.3359225Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:09.3359738Z 2025-05-07T20:32:09.3359847Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:09.3360268Z self=, 2025-05-07T20:32:09.3360664Z T=1, 2025-05-07T20:32:09.3360844Z D=7168, 2025-05-07T20:32:09.3361037Z scale_ub=None, 2025-05-07T20:32:09.3361254Z contiguous=False, 2025-05-07T20:32:09.3361484Z compiled=False, 2025-05-07T20:32:09.3361692Z ) 2025-05-07T20:32:09.3362016Z self = 2025-05-07T20:32:09.3362501Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:09.3362766Z 2025-05-07T20:32:09.3362843Z @given( 2025-05-07T20:32:09.3363076Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:09.3363389Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:09.3363694Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:09.3364022Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:09.3364350Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:09.3364631Z ) 2025-05-07T20:32:09.3364984Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:09.3365422Z def test_silu_mul_quant( 2025-05-07T20:32:09.3365658Z self, 2025-05-07T20:32:09.3365856Z T: int, 2025-05-07T20:32:09.3366056Z D: int, 2025-05-07T20:32:09.3366277Z scale_ub: Optional[float], 2025-05-07T20:32:09.3366552Z contiguous: bool, 2025-05-07T20:32:09.3366798Z compiled: bool, 2025-05-07T20:32:09.3367029Z ) -> None: 2025-05-07T20:32:09.3367248Z torch.manual_seed(2025) 2025-05-07T20:32:09.3367492Z 2025-05-07T20:32:09.3367761Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:09.3368100Z 2025-05-07T20:32:09.3368301Z x_sign = torch.sign(x) 2025-05-07T20:32:09.3368600Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:09.3368906Z x = x_sign * x_clamp 2025-05-07T20:32:09.3369151Z x0 = x[:, :D] 2025-05-07T20:32:09.3369376Z x1 = x[:, D:] 2025-05-07T20:32:09.3369579Z 2025-05-07T20:32:09.3369767Z if contiguous: 2025-05-07T20:32:09.3370003Z x0 = x0.contiguous() 2025-05-07T20:32:09.3370255Z x1 = x1.contiguous() 2025-05-07T20:32:09.3370520Z 2025-05-07T20:32:09.3370723Z if scale_ub is not None: 2025-05-07T20:32:09.3371002Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:09.3371333Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:09.3371644Z ) 2025-05-07T20:32:09.3371847Z else: 2025-05-07T20:32:09.3372057Z scale_ub_tensor = None 2025-05-07T20:32:09.3372315Z 2025-05-07T20:32:09.3372552Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:09.3372913Z op = silu_mul_quant 2025-05-07T20:32:09.3373190Z if compiled: 2025-05-07T20:32:09.3373482Z op = torch.compile(op) 2025-05-07T20:32:09.3373777Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:09.3374054Z 2025-05-07T20:32:09.3374251Z > y_fp8, y_scale = fn() 2025-05-07T20:32:09.3374417Z 2025-05-07T20:32:09.3374521Z moe/activation_test.py:117: 2025-05-07T20:32:09.3374811Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:09.3375146Z moe/activation_test.py:115: in fn 2025-05-07T20:32:09.3375482Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:09.3376234Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:09.3386543Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:09.3387165Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:09.3387872Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:09.3388542Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:09.3389084Z kernel = self.compile( 2025-05-07T20:32:09.3389632Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:09.3390684Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:09.3391100Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:09.3391330Z 2025-05-07T20:32:09.3391561Z self = 2025-05-07T20:32:09.3392632Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:09.3394054Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f06ab5de050>} 2025-05-07T20:32:09.3395389Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:09.3396408Z context = 2025-05-07T20:32:09.3396698Z 2025-05-07T20:32:09.3396872Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:09.3397414Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:09.3397890Z module_map=module_map) 2025-05-07T20:32:09.3398264Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:09.3398620Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:09.3398892Z E ^ 2025-05-07T20:32:09.3399366Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:09.3399811Z 2025-05-07T20:32:09.3400227Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:09.3400743Z 2025-05-07T20:32:09.3400853Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:09.3401278Z self=, 2025-05-07T20:32:09.3401692Z T=2048, 2025-05-07T20:32:09.3401888Z D=7168, 2025-05-07T20:32:09.3402095Z scale_ub=None, 2025-05-07T20:32:09.3402321Z contiguous=False, 2025-05-07T20:32:09.3402550Z compiled=True, 2025-05-07T20:32:09.3402766Z ) 2025-05-07T20:32:09.4404626Z self = 2025-05-07T20:32:09.4405478Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:09.4405752Z 2025-05-07T20:32:09.4405842Z @given( 2025-05-07T20:32:09.4406075Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:09.4406392Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:09.4406705Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:09.4407030Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:09.4407361Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:09.4407756Z ) 2025-05-07T20:32:09.4408102Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:09.4408685Z def test_silu_mul_quant( 2025-05-07T20:32:09.4408935Z self, 2025-05-07T20:32:09.4409130Z T: int, 2025-05-07T20:32:09.4409327Z D: int, 2025-05-07T20:32:09.4409548Z scale_ub: Optional[float], 2025-05-07T20:32:09.4409817Z contiguous: bool, 2025-05-07T20:32:09.4410066Z compiled: bool, 2025-05-07T20:32:09.4410295Z ) -> None: 2025-05-07T20:32:09.4410512Z torch.manual_seed(2025) 2025-05-07T20:32:09.4410759Z 2025-05-07T20:32:09.4411040Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:09.4411391Z 2025-05-07T20:32:09.4411584Z x_sign = torch.sign(x) 2025-05-07T20:32:09.4411883Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:09.4412203Z x = x_sign * x_clamp 2025-05-07T20:32:09.4412442Z x0 = x[:, :D] 2025-05-07T20:32:09.4412672Z x1 = x[:, D:] 2025-05-07T20:32:09.4412883Z 2025-05-07T20:32:09.4413069Z if contiguous: 2025-05-07T20:32:09.4413315Z x0 = x0.contiguous() 2025-05-07T20:32:09.4413585Z x1 = x1.contiguous() 2025-05-07T20:32:09.4413825Z 2025-05-07T20:32:09.4414026Z if scale_ub is not None: 2025-05-07T20:32:09.4414305Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:09.4414638Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:09.4414954Z ) 2025-05-07T20:32:09.4415156Z else: 2025-05-07T20:32:09.4415371Z scale_ub_tensor = None 2025-05-07T20:32:09.4415631Z 2025-05-07T20:32:09.4415867Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:09.4416186Z op = silu_mul_quant 2025-05-07T20:32:09.4416437Z if compiled: 2025-05-07T20:32:09.4416684Z op = torch.compile(op) 2025-05-07T20:32:09.4416980Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:09.4417256Z 2025-05-07T20:32:09.4417448Z > y_fp8, y_scale = fn() 2025-05-07T20:32:09.4417612Z 2025-05-07T20:32:09.4417719Z moe/activation_test.py:117: 2025-05-07T20:32:09.4418015Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:09.4418347Z moe/activation_test.py:115: in fn 2025-05-07T20:32:09.4418629Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:09.4419193Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:09.4419753Z return fn(*args, **kwargs) 2025-05-07T20:32:09.4420505Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:09.4421203Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:09.4421741Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:09.4422420Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:09.4423082Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:09.4423611Z kernel = self.compile( 2025-05-07T20:32:09.4424151Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:09.4424865Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:09.4425256Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:09.4425489Z 2025-05-07T20:32:09.4425697Z self = 2025-05-07T20:32:09.4426770Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:09.4428276Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f06ab5df1c0>} 2025-05-07T20:32:09.4429614Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:09.4430634Z context = 2025-05-07T20:32:09.4430925Z 2025-05-07T20:32:09.4431100Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:09.4431627Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:09.4432085Z module_map=module_map) 2025-05-07T20:32:09.4432459Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:09.4432824Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:09.4433081Z E ^ 2025-05-07T20:32:09.4433596Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:09.4434053Z 2025-05-07T20:32:09.4434467Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:09.4434984Z 2025-05-07T20:32:09.4435099Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:09.4435507Z self=, 2025-05-07T20:32:09.4435908Z T=4096, 2025-05-07T20:32:09.4436099Z D=7168, 2025-05-07T20:32:09.4436290Z scale_ub=None, 2025-05-07T20:32:09.4436528Z contiguous=False, 2025-05-07T20:32:09.4436757Z compiled=True, 2025-05-07T20:32:09.4436962Z ) 2025-05-07T20:32:09.4437274Z self = 2025-05-07T20:32:09.4437767Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:09.4438044Z 2025-05-07T20:32:09.4438124Z @given( 2025-05-07T20:32:09.4438363Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:09.4438673Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:09.4438981Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:09.4439308Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:09.4439633Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:09.4439923Z ) 2025-05-07T20:32:09.4440277Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:09.4440714Z def test_silu_mul_quant( 2025-05-07T20:32:09.4440953Z self, 2025-05-07T20:32:09.4441149Z T: int, 2025-05-07T20:32:09.4441352Z D: int, 2025-05-07T20:32:09.4441567Z scale_ub: Optional[float], 2025-05-07T20:32:09.4441841Z contiguous: bool, 2025-05-07T20:32:09.4442081Z compiled: bool, 2025-05-07T20:32:09.4442308Z ) -> None: 2025-05-07T20:32:09.4442528Z torch.manual_seed(2025) 2025-05-07T20:32:09.4442775Z 2025-05-07T20:32:09.4443050Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:09.4443434Z 2025-05-07T20:32:09.4443643Z x_sign = torch.sign(x) 2025-05-07T20:32:09.4443931Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:09.4444299Z x = x_sign * x_clamp 2025-05-07T20:32:09.4444544Z x0 = x[:, :D] 2025-05-07T20:32:09.4444758Z x1 = x[:, D:] 2025-05-07T20:32:09.4444970Z 2025-05-07T20:32:09.4445158Z if contiguous: 2025-05-07T20:32:09.4445393Z x0 = x0.contiguous() 2025-05-07T20:32:09.4445658Z x1 = x1.contiguous() 2025-05-07T20:32:09.4445898Z 2025-05-07T20:32:09.4446094Z if scale_ub is not None: 2025-05-07T20:32:09.4446371Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:09.4446700Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:09.4447059Z ) 2025-05-07T20:32:09.4447249Z else: 2025-05-07T20:32:09.4447538Z scale_ub_tensor = None 2025-05-07T20:32:09.4447796Z 2025-05-07T20:32:09.4448022Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:09.4448340Z op = silu_mul_quant 2025-05-07T20:32:09.4448597Z if compiled: 2025-05-07T20:32:09.4448846Z op = torch.compile(op) 2025-05-07T20:32:09.4449143Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:09.4449416Z 2025-05-07T20:32:09.4449605Z > y_fp8, y_scale = fn() 2025-05-07T20:32:09.4449773Z 2025-05-07T20:32:09.4449877Z moe/activation_test.py:117: 2025-05-07T20:32:09.4450173Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:09.4450503Z moe/activation_test.py:115: in fn 2025-05-07T20:32:09.4450780Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:09.4451335Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:09.4451895Z return fn(*args, **kwargs) 2025-05-07T20:32:09.4452552Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:09.4453240Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:09.4453772Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:09.4454448Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:09.4455100Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:09.4455629Z kernel = self.compile( 2025-05-07T20:32:09.4456164Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:09.4456811Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:09.4457212Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:09.4457443Z 2025-05-07T20:32:09.4457648Z self = 2025-05-07T20:32:09.4458714Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:09.4460167Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f06aaf301f0>} 2025-05-07T20:32:09.4461504Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:09.4462522Z context = 2025-05-07T20:32:09.4462808Z 2025-05-07T20:32:09.4462986Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:09.4463510Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:09.4463968Z module_map=module_map) 2025-05-07T20:32:09.4464394Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:09.4464753Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:09.4465005Z E ^ 2025-05-07T20:32:09.4465473Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:09.4465921Z 2025-05-07T20:32:09.4466346Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:09.4466852Z 2025-05-07T20:32:09.8116554Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:09.8117543Z self=, 2025-05-07T20:32:09.8118317Z T=16384, 2025-05-07T20:32:09.8118572Z D=5120, 2025-05-07T20:32:09.8118813Z scale_ub=1200.0, 2025-05-07T20:32:09.8119044Z contiguous=False, 2025-05-07T20:32:09.8119277Z compiled=False, 2025-05-07T20:32:09.8119488Z ) 2025-05-07T20:32:09.8119813Z self = 2025-05-07T20:32:09.8120330Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:09.8120613Z 2025-05-07T20:32:09.8120696Z @given( 2025-05-07T20:32:09.8120936Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:09.8121252Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:09.8121566Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:09.8121896Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:09.8122233Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:09.8122531Z ) 2025-05-07T20:32:09.8122890Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:09.8123337Z def test_silu_mul_quant( 2025-05-07T20:32:09.8123590Z self, 2025-05-07T20:32:09.8123788Z T: int, 2025-05-07T20:32:09.8123991Z D: int, 2025-05-07T20:32:09.8124214Z scale_ub: Optional[float], 2025-05-07T20:32:09.8124488Z contiguous: bool, 2025-05-07T20:32:09.8124732Z compiled: bool, 2025-05-07T20:32:09.8124969Z ) -> None: 2025-05-07T20:32:09.8125186Z torch.manual_seed(2025) 2025-05-07T20:32:09.8125433Z 2025-05-07T20:32:09.8125714Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:09.8126064Z 2025-05-07T20:32:09.8126258Z x_sign = torch.sign(x) 2025-05-07T20:32:09.8126556Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:09.8126874Z x = x_sign * x_clamp 2025-05-07T20:32:09.8127122Z x0 = x[:, :D] 2025-05-07T20:32:09.8127347Z x1 = x[:, D:] 2025-05-07T20:32:09.8127563Z 2025-05-07T20:32:09.8127758Z if contiguous: 2025-05-07T20:32:09.8128003Z x0 = x0.contiguous() 2025-05-07T20:32:09.8128273Z x1 = x1.contiguous() 2025-05-07T20:32:09.8128518Z 2025-05-07T20:32:09.8128718Z if scale_ub is not None: 2025-05-07T20:32:09.8129002Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:09.8129341Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:09.8129648Z ) 2025-05-07T20:32:09.8129845Z else: 2025-05-07T20:32:09.8130061Z scale_ub_tensor = None 2025-05-07T20:32:09.8130321Z 2025-05-07T20:32:09.8130557Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:09.8130876Z op = silu_mul_quant 2025-05-07T20:32:09.8131133Z if compiled: 2025-05-07T20:32:09.8131385Z op = torch.compile(op) 2025-05-07T20:32:09.8131691Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:09.8131972Z 2025-05-07T20:32:09.8132182Z > y_fp8, y_scale = fn() 2025-05-07T20:32:09.8132347Z 2025-05-07T20:32:09.8132450Z moe/activation_test.py:117: 2025-05-07T20:32:09.8132752Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:09.8133092Z moe/activation_test.py:115: in fn 2025-05-07T20:32:09.8133460Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:09.8134151Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:09.8134842Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:09.8135375Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:09.8136059Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:09.8136769Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:09.8137385Z kernel = self.compile( 2025-05-07T20:32:09.8137929Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:09.8138585Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:09.8138985Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:09.8139214Z 2025-05-07T20:32:09.8139428Z self = 2025-05-07T20:32:09.8140586Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:09.8141978Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f06aaf30700>} 2025-05-07T20:32:09.8143321Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:09.8144344Z context = 2025-05-07T20:32:09.8144635Z 2025-05-07T20:32:09.8144807Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:09.8145320Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:09.8145787Z module_map=module_map) 2025-05-07T20:32:09.8146156Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:09.8146504Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:09.8146770Z E ^ 2025-05-07T20:32:09.8147235Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:09.8147680Z 2025-05-07T20:32:09.8148109Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:09.8148617Z 2025-05-07T20:32:09.8148724Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:09.8149139Z self=, 2025-05-07T20:32:09.8149544Z T=16384, 2025-05-07T20:32:09.8149738Z D=5120, 2025-05-07T20:32:09.8149939Z scale_ub=1200.0, 2025-05-07T20:32:09.8150173Z contiguous=True, 2025-05-07T20:32:09.8150394Z compiled=True, 2025-05-07T20:32:09.8150603Z ) 2025-05-07T20:32:09.8150925Z self = 2025-05-07T20:32:09.8151414Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:09.8151694Z 2025-05-07T20:32:09.8151771Z @given( 2025-05-07T20:32:09.8152011Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:09.8152329Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:09.8152640Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:09.8152981Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:09.8153310Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:09.8153593Z ) 2025-05-07T20:32:09.8154006Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:09.8154449Z def test_silu_mul_quant( 2025-05-07T20:32:09.8154692Z self, 2025-05-07T20:32:09.8154891Z T: int, 2025-05-07T20:32:09.8155094Z D: int, 2025-05-07T20:32:09.8155314Z scale_ub: Optional[float], 2025-05-07T20:32:09.8155592Z contiguous: bool, 2025-05-07T20:32:09.8155843Z compiled: bool, 2025-05-07T20:32:09.8156072Z ) -> None: 2025-05-07T20:32:09.8156291Z torch.manual_seed(2025) 2025-05-07T20:32:09.8156588Z 2025-05-07T20:32:09.8156867Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:09.8157205Z 2025-05-07T20:32:09.8157483Z x_sign = torch.sign(x) 2025-05-07T20:32:09.8157783Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:09.8158092Z x = x_sign * x_clamp 2025-05-07T20:32:09.8158341Z x0 = x[:, :D] 2025-05-07T20:32:09.8158566Z x1 = x[:, D:] 2025-05-07T20:32:09.8158772Z 2025-05-07T20:32:09.8158961Z if contiguous: 2025-05-07T20:32:09.8159199Z x0 = x0.contiguous() 2025-05-07T20:32:09.8159459Z x1 = x1.contiguous() 2025-05-07T20:32:09.8159704Z 2025-05-07T20:32:09.8159903Z if scale_ub is not None: 2025-05-07T20:32:09.8160173Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:09.8160514Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:09.8160826Z ) 2025-05-07T20:32:09.8161022Z else: 2025-05-07T20:32:09.8161238Z scale_ub_tensor = None 2025-05-07T20:32:09.8161495Z 2025-05-07T20:32:09.8161737Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:09.8162047Z op = silu_mul_quant 2025-05-07T20:32:09.8162305Z if compiled: 2025-05-07T20:32:09.8162559Z op = torch.compile(op) 2025-05-07T20:32:09.8162854Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:09.8163136Z 2025-05-07T20:32:09.8163356Z > y_fp8, y_scale = fn() 2025-05-07T20:32:09.8163544Z 2025-05-07T20:32:09.8163647Z moe/activation_test.py:117: 2025-05-07T20:32:09.8163945Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:09.8164278Z moe/activation_test.py:115: in fn 2025-05-07T20:32:09.8164564Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:09.8165121Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:09.8165685Z return fn(*args, **kwargs) 2025-05-07T20:32:09.8166348Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:09.8167030Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:09.8167574Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:09.8168256Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:09.8168919Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:09.8169446Z kernel = self.compile( 2025-05-07T20:32:09.8169990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:09.8170662Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:09.8171052Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:09.8171290Z 2025-05-07T20:32:09.8171503Z self = 2025-05-07T20:32:09.8172574Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:09.8174005Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f06aaf317e0>} 2025-05-07T20:32:09.8175346Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:09.8176359Z context = 2025-05-07T20:32:09.8176699Z 2025-05-07T20:32:09.8176872Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:09.8177472Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:09.8177950Z module_map=module_map) 2025-05-07T20:32:09.8178315Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:09.8178677Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:09.8178960Z E ^ 2025-05-07T20:32:09.8179438Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:09.8179937Z 2025-05-07T20:32:09.8180352Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:09.8180869Z 2025-05-07T20:32:10.0085320Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:10.0086138Z self=, 2025-05-07T20:32:10.0086849Z T=16384, 2025-05-07T20:32:10.0087149Z D=5120, 2025-05-07T20:32:10.0087442Z scale_ub=None, 2025-05-07T20:32:10.0087779Z contiguous=False, 2025-05-07T20:32:10.0088124Z compiled=True, 2025-05-07T20:32:10.0088426Z ) 2025-05-07T20:32:10.0088908Z self = 2025-05-07T20:32:10.0089653Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:10.0090447Z 2025-05-07T20:32:10.0090563Z @given( 2025-05-07T20:32:10.0090922Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:10.0091394Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:10.0091848Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:10.0101825Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:10.0102225Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:10.0102534Z ) 2025-05-07T20:32:10.0102905Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:10.0103364Z def test_silu_mul_quant( 2025-05-07T20:32:10.0103627Z self, 2025-05-07T20:32:10.0103844Z T: int, 2025-05-07T20:32:10.0104057Z D: int, 2025-05-07T20:32:10.0104286Z scale_ub: Optional[float], 2025-05-07T20:32:10.0104575Z contiguous: bool, 2025-05-07T20:32:10.0104833Z compiled: bool, 2025-05-07T20:32:10.0105073Z ) -> None: 2025-05-07T20:32:10.0105308Z torch.manual_seed(2025) 2025-05-07T20:32:10.0105567Z 2025-05-07T20:32:10.0105853Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:10.0106212Z 2025-05-07T20:32:10.0106421Z x_sign = torch.sign(x) 2025-05-07T20:32:10.0106727Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:10.0107054Z x = x_sign * x_clamp 2025-05-07T20:32:10.0107319Z x0 = x[:, :D] 2025-05-07T20:32:10.0107545Z x1 = x[:, D:] 2025-05-07T20:32:10.0107774Z 2025-05-07T20:32:10.0107975Z if contiguous: 2025-05-07T20:32:10.0108218Z x0 = x0.contiguous() 2025-05-07T20:32:10.0108500Z x1 = x1.contiguous() 2025-05-07T20:32:10.0108756Z 2025-05-07T20:32:10.0108956Z if scale_ub is not None: 2025-05-07T20:32:10.0109249Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:10.0109600Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:10.0110184Z ) 2025-05-07T20:32:10.0110381Z else: 2025-05-07T20:32:10.0110599Z scale_ub_tensor = None 2025-05-07T20:32:10.0110861Z 2025-05-07T20:32:10.0111102Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:10.0111433Z op = silu_mul_quant 2025-05-07T20:32:10.0111705Z if compiled: 2025-05-07T20:32:10.0111964Z op = torch.compile(op) 2025-05-07T20:32:10.0112277Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:10.0112567Z 2025-05-07T20:32:10.0112869Z > y_fp8, y_scale = fn() 2025-05-07T20:32:10.0113054Z 2025-05-07T20:32:10.0113163Z moe/activation_test.py:117: 2025-05-07T20:32:10.0113615Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:10.0113963Z moe/activation_test.py:115: in fn 2025-05-07T20:32:10.0114261Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:10.0114836Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:10.0115415Z return fn(*args, **kwargs) 2025-05-07T20:32:10.0116078Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:10.0116783Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:10.0117337Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:10.0118033Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:10.0118715Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:10.0119263Z kernel = self.compile( 2025-05-07T20:32:10.0119820Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:10.0120488Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:10.0120909Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:10.0121155Z 2025-05-07T20:32:10.0121371Z self = 2025-05-07T20:32:10.0122467Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:10.0123876Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f06aaf32680>} 2025-05-07T20:32:10.0125229Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:10.0126266Z context = 2025-05-07T20:32:10.0126558Z 2025-05-07T20:32:10.0126742Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:10.0127279Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:10.0127756Z module_map=module_map) 2025-05-07T20:32:10.0128139Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:10.0128512Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:10.0128782Z E ^ 2025-05-07T20:32:10.0129263Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:10.0129717Z 2025-05-07T20:32:10.0130149Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:10.0130664Z 2025-05-07T20:32:10.0130785Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:10.0131262Z self=, 2025-05-07T20:32:10.0131671Z T=2048, 2025-05-07T20:32:10.0131865Z D=5120, 2025-05-07T20:32:10.0132068Z scale_ub=None, 2025-05-07T20:32:10.0132292Z contiguous=False, 2025-05-07T20:32:10.0132519Z compiled=True, 2025-05-07T20:32:10.0132738Z ) 2025-05-07T20:32:10.1168197Z self = 2025-05-07T20:32:10.1168816Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:10.1169324Z 2025-05-07T20:32:10.1169411Z @given( 2025-05-07T20:32:10.1169652Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:10.1170125Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:10.1170448Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:10.1170781Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:10.1171119Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:10.1171423Z ) 2025-05-07T20:32:10.1171784Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:10.1172228Z def test_silu_mul_quant( 2025-05-07T20:32:10.1172485Z self, 2025-05-07T20:32:10.1172694Z T: int, 2025-05-07T20:32:10.1172903Z D: int, 2025-05-07T20:32:10.1173138Z scale_ub: Optional[float], 2025-05-07T20:32:10.1173421Z contiguous: bool, 2025-05-07T20:32:10.1173667Z compiled: bool, 2025-05-07T20:32:10.1173910Z ) -> None: 2025-05-07T20:32:10.1174148Z torch.manual_seed(2025) 2025-05-07T20:32:10.1174394Z 2025-05-07T20:32:10.1174685Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:10.1175035Z 2025-05-07T20:32:10.1175234Z x_sign = torch.sign(x) 2025-05-07T20:32:10.1175538Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:10.1175862Z x = x_sign * x_clamp 2025-05-07T20:32:10.1176115Z x0 = x[:, :D] 2025-05-07T20:32:10.1176346Z x1 = x[:, D:] 2025-05-07T20:32:10.1176567Z 2025-05-07T20:32:10.1176759Z if contiguous: 2025-05-07T20:32:10.1177003Z x0 = x0.contiguous() 2025-05-07T20:32:10.1177282Z x1 = x1.contiguous() 2025-05-07T20:32:10.1177533Z 2025-05-07T20:32:10.1177732Z if scale_ub is not None: 2025-05-07T20:32:10.1178014Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:10.1178359Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:10.1178677Z ) 2025-05-07T20:32:10.1178882Z else: 2025-05-07T20:32:10.1179104Z scale_ub_tensor = None 2025-05-07T20:32:10.1179361Z 2025-05-07T20:32:10.1179608Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:10.1180020Z op = silu_mul_quant 2025-05-07T20:32:10.1180276Z if compiled: 2025-05-07T20:32:10.1180535Z op = torch.compile(op) 2025-05-07T20:32:10.1180846Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:10.1181125Z 2025-05-07T20:32:10.1181332Z > y_fp8, y_scale = fn() 2025-05-07T20:32:10.1181503Z 2025-05-07T20:32:10.1181614Z moe/activation_test.py:117: 2025-05-07T20:32:10.1181923Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:10.1182259Z moe/activation_test.py:115: in fn 2025-05-07T20:32:10.1182552Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:10.1183123Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:10.1183689Z return fn(*args, **kwargs) 2025-05-07T20:32:10.1184360Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:10.1185060Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:10.1185611Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:10.1186382Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:10.1187059Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:10.1187598Z kernel = self.compile( 2025-05-07T20:32:10.1188142Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:10.1188804Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:10.1189262Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:10.1189493Z 2025-05-07T20:32:10.1189795Z self = 2025-05-07T20:32:10.1191132Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:10.1192521Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f06aaf32560>} 2025-05-07T20:32:10.1193872Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:10.1194905Z context = 2025-05-07T20:32:10.1195199Z 2025-05-07T20:32:10.1195386Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:10.1195909Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:10.1196384Z module_map=module_map) 2025-05-07T20:32:10.1196768Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:10.1197131Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:10.1197407Z E ^ 2025-05-07T20:32:10.1197881Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:10.1198330Z 2025-05-07T20:32:10.1198764Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:10.1199279Z 2025-05-07T20:32:10.1199389Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:10.1199812Z self=, 2025-05-07T20:32:10.1200222Z T=2048, 2025-05-07T20:32:10.1200412Z D=5120, 2025-05-07T20:32:10.1200620Z scale_ub=1200.0, 2025-05-07T20:32:10.1200857Z contiguous=False, 2025-05-07T20:32:10.1201085Z compiled=True, 2025-05-07T20:32:10.1201302Z ) 2025-05-07T20:32:10.1201633Z self = 2025-05-07T20:32:10.1202144Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:10.1202417Z 2025-05-07T20:32:10.1202497Z @given( 2025-05-07T20:32:10.1202738Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:10.1203065Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:10.1203373Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:10.1203711Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:10.1204046Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:10.1204337Z ) 2025-05-07T20:32:10.1204694Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:10.1205147Z def test_silu_mul_quant( 2025-05-07T20:32:10.1205397Z self, 2025-05-07T20:32:10.1205595Z T: int, 2025-05-07T20:32:10.1205803Z D: int, 2025-05-07T20:32:10.1206032Z scale_ub: Optional[float], 2025-05-07T20:32:10.1206311Z contiguous: bool, 2025-05-07T20:32:10.1206662Z compiled: bool, 2025-05-07T20:32:10.1206894Z ) -> None: 2025-05-07T20:32:10.1207113Z torch.manual_seed(2025) 2025-05-07T20:32:10.1207365Z 2025-05-07T20:32:10.1207648Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:10.1207988Z 2025-05-07T20:32:10.1208189Z x_sign = torch.sign(x) 2025-05-07T20:32:10.1208494Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:10.1208812Z x = x_sign * x_clamp 2025-05-07T20:32:10.1209064Z x0 = x[:, :D] 2025-05-07T20:32:10.1209364Z x1 = x[:, D:] 2025-05-07T20:32:10.1209577Z 2025-05-07T20:32:10.1209775Z if contiguous: 2025-05-07T20:32:10.1210200Z x0 = x0.contiguous() 2025-05-07T20:32:10.1210470Z x1 = x1.contiguous() 2025-05-07T20:32:10.1210725Z 2025-05-07T20:32:10.1210930Z if scale_ub is not None: 2025-05-07T20:32:10.1211205Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:10.1211560Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:10.1211875Z ) 2025-05-07T20:32:10.1212083Z else: 2025-05-07T20:32:10.1212304Z scale_ub_tensor = None 2025-05-07T20:32:10.1212569Z 2025-05-07T20:32:10.1212817Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:10.1213133Z op = silu_mul_quant 2025-05-07T20:32:10.1213396Z if compiled: 2025-05-07T20:32:10.1213655Z op = torch.compile(op) 2025-05-07T20:32:10.1213962Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:10.1214251Z 2025-05-07T20:32:10.1214454Z > y_fp8, y_scale = fn() 2025-05-07T20:32:10.1214624Z 2025-05-07T20:32:10.1214735Z moe/activation_test.py:117: 2025-05-07T20:32:10.1215040Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:10.1215378Z moe/activation_test.py:115: in fn 2025-05-07T20:32:10.1215667Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:10.1216230Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:10.1216797Z return fn(*args, **kwargs) 2025-05-07T20:32:10.1217464Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:10.1218153Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:10.1218695Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:10.1219392Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:10.1220146Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:10.1220679Z kernel = self.compile( 2025-05-07T20:32:10.1221226Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:10.1221885Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:10.1222281Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:10.1222520Z 2025-05-07T20:32:10.1222730Z self = 2025-05-07T20:32:10.1223864Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:10.1225241Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f06aaf33370>} 2025-05-07T20:32:10.1226590Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:10.1227676Z context = 2025-05-07T20:32:10.1227974Z 2025-05-07T20:32:10.1228144Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:10.1228682Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:10.1229160Z module_map=module_map) 2025-05-07T20:32:10.1229533Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:10.1229943Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:10.1230213Z E ^ 2025-05-07T20:32:10.1230764Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:10.1231222Z 2025-05-07T20:32:10.1231640Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:10.1232161Z 2025-05-07T20:32:10.3143917Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:10.3144501Z self=, 2025-05-07T20:32:10.3144913Z T=4096, 2025-05-07T20:32:10.3145104Z D=5120, 2025-05-07T20:32:10.3145304Z scale_ub=1200.0, 2025-05-07T20:32:10.3145533Z contiguous=True, 2025-05-07T20:32:10.3145754Z compiled=True, 2025-05-07T20:32:10.3145975Z ) 2025-05-07T20:32:10.3146298Z self = 2025-05-07T20:32:10.3146788Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:10.3147085Z 2025-05-07T20:32:10.3147167Z @given( 2025-05-07T20:32:10.3147418Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:10.3147731Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:10.3148041Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:10.3148374Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:10.3148711Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:10.3148991Z ) 2025-05-07T20:32:10.3149344Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:10.3149787Z def test_silu_mul_quant( 2025-05-07T20:32:10.3150028Z self, 2025-05-07T20:32:10.3150230Z T: int, 2025-05-07T20:32:10.3150432Z D: int, 2025-05-07T20:32:10.3150649Z scale_ub: Optional[float], 2025-05-07T20:32:10.3150924Z contiguous: bool, 2025-05-07T20:32:10.3151167Z compiled: bool, 2025-05-07T20:32:10.3151395Z ) -> None: 2025-05-07T20:32:10.3151614Z torch.manual_seed(2025) 2025-05-07T20:32:10.3151860Z 2025-05-07T20:32:10.3152134Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:10.3152480Z 2025-05-07T20:32:10.3152680Z x_sign = torch.sign(x) 2025-05-07T20:32:10.3152967Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:10.3153282Z x = x_sign * x_clamp 2025-05-07T20:32:10.3153548Z x0 = x[:, :D] 2025-05-07T20:32:10.3153793Z x1 = x[:, D:] 2025-05-07T20:32:10.3153996Z 2025-05-07T20:32:10.3154187Z if contiguous: 2025-05-07T20:32:10.3154425Z x0 = x0.contiguous() 2025-05-07T20:32:10.3154682Z x1 = x1.contiguous() 2025-05-07T20:32:10.3154924Z 2025-05-07T20:32:10.3155121Z if scale_ub is not None: 2025-05-07T20:32:10.3155386Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:10.3155722Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:10.3156034Z ) 2025-05-07T20:32:10.3156228Z else: 2025-05-07T20:32:10.3156452Z scale_ub_tensor = None 2025-05-07T20:32:10.3156709Z 2025-05-07T20:32:10.3156936Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:10.3157251Z op = silu_mul_quant 2025-05-07T20:32:10.3157504Z if compiled: 2025-05-07T20:32:10.3158044Z op = torch.compile(op) 2025-05-07T20:32:10.3158344Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:10.3158618Z 2025-05-07T20:32:10.3158815Z > y_fp8, y_scale = fn() 2025-05-07T20:32:10.3158980Z 2025-05-07T20:32:10.3159082Z moe/activation_test.py:117: 2025-05-07T20:32:10.3159381Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:10.3159720Z moe/activation_test.py:115: in fn 2025-05-07T20:32:10.3160002Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:10.3160661Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:10.3161356Z return fn(*args, **kwargs) 2025-05-07T20:32:10.3162012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:10.3162701Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:10.3163241Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:10.3163921Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:10.3164573Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:10.3165107Z kernel = self.compile( 2025-05-07T20:32:10.3165649Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:10.3166307Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:10.3166706Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:10.3166938Z 2025-05-07T20:32:10.3167147Z self = 2025-05-07T20:32:10.3168224Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:10.3169606Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f06ab01c310>} 2025-05-07T20:32:10.3170935Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:10.3171968Z context = 2025-05-07T20:32:10.3172259Z 2025-05-07T20:32:10.3172432Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:10.3172953Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:10.3173410Z module_map=module_map) 2025-05-07T20:32:10.3173780Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:10.3174134Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:10.3174384Z E ^ 2025-05-07T20:32:10.3174846Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:10.3175297Z 2025-05-07T20:32:10.3175711Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:10.3176219Z 2025-05-07T20:32:10.3176331Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:10.3176735Z self=, 2025-05-07T20:32:10.3177137Z T=128, 2025-05-07T20:32:10.3177328Z D=5120, 2025-05-07T20:32:10.3177525Z scale_ub=1200.0, 2025-05-07T20:32:10.3177743Z contiguous=False, 2025-05-07T20:32:10.3177969Z compiled=True, 2025-05-07T20:32:10.3178171Z ) 2025-05-07T20:32:10.6252448Z self = 2025-05-07T20:32:10.6253212Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:10.6253665Z 2025-05-07T20:32:10.6253840Z @given( 2025-05-07T20:32:10.6254241Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:10.6254712Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:10.6255167Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:10.6255655Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:10.6256510Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:10.6256938Z ) 2025-05-07T20:32:10.6257651Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:10.6258328Z def test_silu_mul_quant( 2025-05-07T20:32:10.6258690Z self, 2025-05-07T20:32:10.6258982Z T: int, 2025-05-07T20:32:10.6259270Z D: int, 2025-05-07T20:32:10.6259599Z scale_ub: Optional[float], 2025-05-07T20:32:10.6260134Z contiguous: bool, 2025-05-07T20:32:10.6260493Z compiled: bool, 2025-05-07T20:32:10.6260841Z ) -> None: 2025-05-07T20:32:10.6261165Z torch.manual_seed(2025) 2025-05-07T20:32:10.6261521Z 2025-05-07T20:32:10.6261934Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:10.6262466Z 2025-05-07T20:32:10.6262757Z x_sign = torch.sign(x) 2025-05-07T20:32:10.6263198Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:10.6263580Z x = x_sign * x_clamp 2025-05-07T20:32:10.6263830Z x0 = x[:, :D] 2025-05-07T20:32:10.6264058Z x1 = x[:, D:] 2025-05-07T20:32:10.6264265Z 2025-05-07T20:32:10.6264470Z if contiguous: 2025-05-07T20:32:10.6264712Z x0 = x0.contiguous() 2025-05-07T20:32:10.6264970Z x1 = x1.contiguous() 2025-05-07T20:32:10.6265211Z 2025-05-07T20:32:10.6265413Z if scale_ub is not None: 2025-05-07T20:32:10.6265693Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:10.6266033Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:10.6266343Z ) 2025-05-07T20:32:10.6266536Z else: 2025-05-07T20:32:10.6266756Z scale_ub_tensor = None 2025-05-07T20:32:10.6267014Z 2025-05-07T20:32:10.6267246Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:10.6277821Z op = silu_mul_quant 2025-05-07T20:32:10.6278110Z if compiled: 2025-05-07T20:32:10.6278378Z op = torch.compile(op) 2025-05-07T20:32:10.6278702Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:10.6278984Z 2025-05-07T20:32:10.6279202Z > y_fp8, y_scale = fn() 2025-05-07T20:32:10.6279372Z 2025-05-07T20:32:10.6279486Z moe/activation_test.py:117: 2025-05-07T20:32:10.6279793Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:10.6280142Z moe/activation_test.py:115: in fn 2025-05-07T20:32:10.6280442Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:10.6281010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:10.6281586Z return fn(*args, **kwargs) 2025-05-07T20:32:10.6282261Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:10.6282965Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:10.6283506Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:10.6284205Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:10.6284878Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:10.6285423Z kernel = self.compile( 2025-05-07T20:32:10.6285972Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:10.6286762Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:10.6287172Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:10.6287405Z 2025-05-07T20:32:10.6287617Z self = 2025-05-07T20:32:10.6288708Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:10.6290619Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f06ab01d090>} 2025-05-07T20:32:10.6291970Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:10.6292999Z context = 2025-05-07T20:32:10.6293287Z 2025-05-07T20:32:10.6293456Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:10.6293992Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:10.6294467Z module_map=module_map) 2025-05-07T20:32:10.6294843Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:10.6295205Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:10.6295473Z E ^ 2025-05-07T20:32:10.6295954Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:10.6296404Z 2025-05-07T20:32:10.6296824Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:10.6297346Z 2025-05-07T20:32:10.6297456Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:10.6297877Z self=, 2025-05-07T20:32:10.6298282Z T=16384, 2025-05-07T20:32:10.6298474Z D=7168, 2025-05-07T20:32:10.6298678Z scale_ub=1200.0, 2025-05-07T20:32:10.6298910Z contiguous=True, 2025-05-07T20:32:10.6299133Z compiled=True, 2025-05-07T20:32:10.6299345Z ) 2025-05-07T20:32:10.6299667Z self = 2025-05-07T20:32:10.6300263Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:10.6300552Z 2025-05-07T20:32:10.6300630Z @given( 2025-05-07T20:32:10.6300870Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:10.6301186Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:10.6301501Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:10.6301849Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:10.6302189Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:10.6302478Z ) 2025-05-07T20:32:10.6302839Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:10.6303293Z def test_silu_mul_quant( 2025-05-07T20:32:10.6303548Z self, 2025-05-07T20:32:10.6303792Z T: int, 2025-05-07T20:32:10.6304012Z D: int, 2025-05-07T20:32:10.6304237Z scale_ub: Optional[float], 2025-05-07T20:32:10.6304524Z contiguous: bool, 2025-05-07T20:32:10.6304776Z compiled: bool, 2025-05-07T20:32:10.6305003Z ) -> None: 2025-05-07T20:32:10.6305239Z torch.manual_seed(2025) 2025-05-07T20:32:10.6305493Z 2025-05-07T20:32:10.6305769Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:10.6306118Z 2025-05-07T20:32:10.6306325Z x_sign = torch.sign(x) 2025-05-07T20:32:10.6306698Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:10.6307020Z x = x_sign * x_clamp 2025-05-07T20:32:10.6307274Z x0 = x[:, :D] 2025-05-07T20:32:10.6307500Z x1 = x[:, D:] 2025-05-07T20:32:10.6307711Z 2025-05-07T20:32:10.6307894Z if contiguous: 2025-05-07T20:32:10.6308137Z x0 = x0.contiguous() 2025-05-07T20:32:10.6308401Z x1 = x1.contiguous() 2025-05-07T20:32:10.6308636Z 2025-05-07T20:32:10.6308835Z if scale_ub is not None: 2025-05-07T20:32:10.6309116Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:10.6309531Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:10.6309956Z ) 2025-05-07T20:32:10.6310162Z else: 2025-05-07T20:32:10.6310382Z scale_ub_tensor = None 2025-05-07T20:32:10.6310634Z 2025-05-07T20:32:10.6310874Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:10.6311197Z op = silu_mul_quant 2025-05-07T20:32:10.6311455Z if compiled: 2025-05-07T20:32:10.6311713Z op = torch.compile(op) 2025-05-07T20:32:10.6312015Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:10.6312289Z 2025-05-07T20:32:10.6312491Z > y_fp8, y_scale = fn() 2025-05-07T20:32:10.6312658Z 2025-05-07T20:32:10.6312765Z moe/activation_test.py:117: 2025-05-07T20:32:10.6313065Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:10.6313406Z moe/activation_test.py:115: in fn 2025-05-07T20:32:10.6313702Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:10.6314268Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:10.6314825Z return fn(*args, **kwargs) 2025-05-07T20:32:10.6315485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:10.6316174Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:10.6316709Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:10.6317398Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:10.6318063Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:10.6318597Z kernel = self.compile( 2025-05-07T20:32:10.6319135Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:10.6319800Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:10.6320204Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:10.6320433Z 2025-05-07T20:32:10.6320647Z self = 2025-05-07T20:32:10.6321713Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:10.6323079Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f06ab01e290>} 2025-05-07T20:32:10.6324418Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:10.6325447Z context = 2025-05-07T20:32:10.6325735Z 2025-05-07T20:32:10.6325910Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:10.6326426Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:10.6326942Z module_map=module_map) 2025-05-07T20:32:10.6327312Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:10.6327662Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:10.6327925Z E ^ 2025-05-07T20:32:10.6328396Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:10.6328841Z 2025-05-07T20:32:10.6329265Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:10.6329829Z 2025-05-07T20:32:10.7673381Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:10.7674450Z self=, 2025-05-07T20:32:10.7674956Z T=16384, 2025-05-07T20:32:10.7675162Z D=5120, 2025-05-07T20:32:10.7675365Z scale_ub=1200.0, 2025-05-07T20:32:10.7675595Z contiguous=True, 2025-05-07T20:32:10.7675823Z compiled=False, 2025-05-07T20:32:10.7676056Z ) 2025-05-07T20:32:10.7676379Z self = 2025-05-07T20:32:10.7676885Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:10.7677167Z 2025-05-07T20:32:10.7677256Z @given( 2025-05-07T20:32:10.7677485Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:10.7677908Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:10.7678345Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:10.7678723Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:10.7679043Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:10.7679334Z ) 2025-05-07T20:32:10.7679686Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:10.7680121Z def test_silu_mul_quant( 2025-05-07T20:32:10.7680368Z self, 2025-05-07T20:32:10.7680563Z T: int, 2025-05-07T20:32:10.7680759Z D: int, 2025-05-07T20:32:10.7680977Z scale_ub: Optional[float], 2025-05-07T20:32:10.7681246Z contiguous: bool, 2025-05-07T20:32:10.7681481Z compiled: bool, 2025-05-07T20:32:10.7681705Z ) -> None: 2025-05-07T20:32:10.7681923Z torch.manual_seed(2025) 2025-05-07T20:32:10.7682160Z 2025-05-07T20:32:10.7682432Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:10.7682775Z 2025-05-07T20:32:10.7682968Z x_sign = torch.sign(x) 2025-05-07T20:32:10.7683255Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:10.7683570Z x = x_sign * x_clamp 2025-05-07T20:32:10.7683814Z x0 = x[:, :D] 2025-05-07T20:32:10.7684032Z x1 = x[:, D:] 2025-05-07T20:32:10.7684243Z 2025-05-07T20:32:10.7684429Z if contiguous: 2025-05-07T20:32:10.7684656Z x0 = x0.contiguous() 2025-05-07T20:32:10.7684914Z x1 = x1.contiguous() 2025-05-07T20:32:10.7685153Z 2025-05-07T20:32:10.7685339Z if scale_ub is not None: 2025-05-07T20:32:10.7685612Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:10.7685947Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:10.7686249Z ) 2025-05-07T20:32:10.7686443Z else: 2025-05-07T20:32:10.7686659Z scale_ub_tensor = None 2025-05-07T20:32:10.7686905Z 2025-05-07T20:32:10.7687138Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:10.7687455Z op = silu_mul_quant 2025-05-07T20:32:10.7687712Z if compiled: 2025-05-07T20:32:10.7687958Z op = torch.compile(op) 2025-05-07T20:32:10.7688264Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:10.7688540Z 2025-05-07T20:32:10.7688728Z > y_fp8, y_scale = fn() 2025-05-07T20:32:10.7688901Z 2025-05-07T20:32:10.7689001Z moe/activation_test.py:117: 2025-05-07T20:32:10.7689296Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:10.7689734Z moe/activation_test.py:115: in fn 2025-05-07T20:32:10.7690317Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:10.7691009Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:10.7691694Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:10.7692222Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:10.7692903Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:10.7693762Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:10.7694296Z kernel = self.compile( 2025-05-07T20:32:10.7694837Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:10.7695494Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:10.7695893Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:10.7696118Z 2025-05-07T20:32:10.7696325Z self = 2025-05-07T20:32:10.7697397Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:10.7698778Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f06ab01d1b0>} 2025-05-07T20:32:10.7700223Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:10.7701248Z context = 2025-05-07T20:32:10.7701536Z 2025-05-07T20:32:10.7701701Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:10.7702221Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:10.7702690Z module_map=module_map) 2025-05-07T20:32:10.7703055Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:10.7703407Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:10.7703674Z E ^ 2025-05-07T20:32:10.7704147Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:10.7704593Z 2025-05-07T20:32:10.7705009Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:10.7705524Z 2025-05-07T20:32:10.7705628Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:10.7706049Z self=, 2025-05-07T20:32:10.7706447Z T=1, 2025-05-07T20:32:10.7706628Z D=7168, 2025-05-07T20:32:10.7706826Z scale_ub=1200.0, 2025-05-07T20:32:10.7707054Z contiguous=False, 2025-05-07T20:32:10.7707277Z compiled=False, 2025-05-07T20:32:10.7707488Z ) 2025-05-07T20:32:10.7707806Z self = 2025-05-07T20:32:10.7708291Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:10.7708564Z 2025-05-07T20:32:10.7708640Z @given( 2025-05-07T20:32:10.7708933Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:10.7709366Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:10.7709726Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:10.7710322Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:10.7710731Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:10.7711163Z ) 2025-05-07T20:32:10.7711725Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:10.7712277Z def test_silu_mul_quant( 2025-05-07T20:32:10.7712559Z self, 2025-05-07T20:32:10.7712927Z T: int, 2025-05-07T20:32:10.7713211Z D: int, 2025-05-07T20:32:10.7713472Z scale_ub: Optional[float], 2025-05-07T20:32:10.7713916Z contiguous: bool, 2025-05-07T20:32:10.7714242Z compiled: bool, 2025-05-07T20:32:10.7714508Z ) -> None: 2025-05-07T20:32:10.7714945Z torch.manual_seed(2025) 2025-05-07T20:32:10.7715274Z 2025-05-07T20:32:10.7715705Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:10.7716216Z 2025-05-07T20:32:10.7716498Z x_sign = torch.sign(x) 2025-05-07T20:32:10.7716940Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:10.7717322Z x = x_sign * x_clamp 2025-05-07T20:32:10.7717653Z x0 = x[:, :D] 2025-05-07T20:32:10.7718017Z x1 = x[:, D:] 2025-05-07T20:32:10.7718292Z 2025-05-07T20:32:10.7718565Z if contiguous: 2025-05-07T20:32:10.7718947Z x0 = x0.contiguous() 2025-05-07T20:32:10.7719275Z x1 = x1.contiguous() 2025-05-07T20:32:10.7719667Z 2025-05-07T20:32:10.7719985Z if scale_ub is not None: 2025-05-07T20:32:10.7720332Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:10.7720775Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:10.7721206Z ) 2025-05-07T20:32:10.7721499Z else: 2025-05-07T20:32:10.7721788Z scale_ub_tensor = None 2025-05-07T20:32:10.7722167Z 2025-05-07T20:32:10.7722505Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:10.7722890Z op = silu_mul_quant 2025-05-07T20:32:10.7723260Z if compiled: 2025-05-07T20:32:10.7723635Z op = torch.compile(op) 2025-05-07T20:32:10.7724034Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:10.7724427Z 2025-05-07T20:32:10.7724745Z > y_fp8, y_scale = fn() 2025-05-07T20:32:10.7724938Z 2025-05-07T20:32:10.7725099Z moe/activation_test.py:117: 2025-05-07T20:32:10.7725479Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:10.7725935Z moe/activation_test.py:115: in fn 2025-05-07T20:32:10.7726303Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:10.7727075Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:10.7727888Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:10.7728514Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:10.7729361Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:10.7730083Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:10.7730744Z kernel = self.compile( 2025-05-07T20:32:10.7731445Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:10.7732180Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:10.7732630Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:10.7732968Z 2025-05-07T20:32:10.7733244Z self = 2025-05-07T20:32:10.7734424Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:10.7735872Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f06ab01e680>} 2025-05-07T20:32:10.7737433Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:10.7738516Z context = 2025-05-07T20:32:10.7738863Z 2025-05-07T20:32:10.7739068Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:10.7739874Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:10.7740578Z module_map=module_map) 2025-05-07T20:32:10.7740988Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:10.7741500Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:10.7741876Z E ^ 2025-05-07T20:32:10.7742377Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:10.7742983Z 2025-05-07T20:32:10.7743431Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:10.7744021Z 2025-05-07T20:32:10.9655784Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:10.9656638Z self=, 2025-05-07T20:32:10.9657294Z T=4096, 2025-05-07T20:32:10.9657697Z D=7168, 2025-05-07T20:32:10.9658009Z scale_ub=1200.0, 2025-05-07T20:32:10.9658305Z contiguous=False, 2025-05-07T20:32:10.9658675Z compiled=True, 2025-05-07T20:32:10.9658975Z ) 2025-05-07T20:32:10.9659384Z self = 2025-05-07T20:32:10.9660109Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:10.9660446Z 2025-05-07T20:32:10.9660550Z @given( 2025-05-07T20:32:10.9660882Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:10.9661352Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:10.9661755Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:10.9662177Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:10.9662654Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:10.9662993Z ) 2025-05-07T20:32:10.9663441Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:10.9664024Z def test_silu_mul_quant( 2025-05-07T20:32:10.9664377Z self, 2025-05-07T20:32:10.9664613Z T: int, 2025-05-07T20:32:10.9664953Z D: int, 2025-05-07T20:32:10.9665286Z scale_ub: Optional[float], 2025-05-07T20:32:10.9665597Z contiguous: bool, 2025-05-07T20:32:10.9665985Z compiled: bool, 2025-05-07T20:32:10.9666364Z ) -> None: 2025-05-07T20:32:10.9666622Z torch.manual_seed(2025) 2025-05-07T20:32:10.9667013Z 2025-05-07T20:32:10.9667395Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:10.9667854Z 2025-05-07T20:32:10.9668116Z x_sign = torch.sign(x) 2025-05-07T20:32:10.9668518Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:10.9668962Z x = x_sign * x_clamp 2025-05-07T20:32:10.9669273Z x0 = x[:, :D] 2025-05-07T20:32:10.9669598Z x1 = x[:, D:] 2025-05-07T20:32:10.9669925Z 2025-05-07T20:32:10.9670201Z if contiguous: 2025-05-07T20:32:10.9670565Z x0 = x0.contiguous() 2025-05-07T20:32:10.9670947Z x1 = x1.contiguous() 2025-05-07T20:32:10.9671274Z 2025-05-07T20:32:10.9671562Z if scale_ub is not None: 2025-05-07T20:32:10.9671958Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:10.9672420Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:10.9672780Z ) 2025-05-07T20:32:10.9673088Z else: 2025-05-07T20:32:10.9673694Z scale_ub_tensor = None 2025-05-07T20:32:10.9673998Z 2025-05-07T20:32:10.9674375Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:10.9674838Z op = silu_mul_quant 2025-05-07T20:32:10.9675144Z if compiled: 2025-05-07T20:32:10.9675532Z op = torch.compile(op) 2025-05-07T20:32:10.9675930Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:10.9676254Z 2025-05-07T20:32:10.9676585Z > y_fp8, y_scale = fn() 2025-05-07T20:32:10.9676828Z 2025-05-07T20:32:10.9676956Z moe/activation_test.py:117: 2025-05-07T20:32:10.9677458Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:10.9678011Z moe/activation_test.py:115: in fn 2025-05-07T20:32:10.9678393Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:10.9679065Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:10.9679752Z return fn(*args, **kwargs) 2025-05-07T20:32:10.9680511Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:10.9681302Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:10.9681970Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:10.9682720Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:10.9683462Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:10.9684132Z kernel = self.compile( 2025-05-07T20:32:10.9684790Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:10.9685501Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:10.9686031Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:10.9686309Z 2025-05-07T20:32:10.9686582Z self = 2025-05-07T20:32:10.9687741Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:10.9689298Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f06ab01fb50>} 2025-05-07T20:32:10.9691059Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:10.9692147Z context = 2025-05-07T20:32:10.9692520Z 2025-05-07T20:32:10.9692787Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:10.9693396Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:10.9693897Z module_map=module_map) 2025-05-07T20:32:10.9694437Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:10.9694871Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:10.9695170Z E ^ 2025-05-07T20:32:10.9695836Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:10.9696340Z 2025-05-07T20:32:10.9696797Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:10.9697343Z 2025-05-07T20:32:10.9697610Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:10.9698096Z self=, 2025-05-07T20:32:10.9698657Z T=128, 2025-05-07T20:32:10.9698998Z D=7168, 2025-05-07T20:32:10.9699262Z scale_ub=1200.0, 2025-05-07T20:32:10.9699573Z contiguous=False, 2025-05-07T20:32:10.9700039Z compiled=True, 2025-05-07T20:32:10.9700342Z ) 2025-05-07T20:32:11.0728026Z self = 2025-05-07T20:32:11.0728762Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:11.0729152Z 2025-05-07T20:32:11.0729264Z @given( 2025-05-07T20:32:11.0729592Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.0730170Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.0730620Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.0730969Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.0731310Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.0731602Z ) 2025-05-07T20:32:11.0731964Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.0732422Z def test_silu_mul_quant( 2025-05-07T20:32:11.0732670Z self, 2025-05-07T20:32:11.0732874Z T: int, 2025-05-07T20:32:11.0733081Z D: int, 2025-05-07T20:32:11.0733303Z scale_ub: Optional[float], 2025-05-07T20:32:11.0733587Z contiguous: bool, 2025-05-07T20:32:11.0733834Z compiled: bool, 2025-05-07T20:32:11.0734064Z ) -> None: 2025-05-07T20:32:11.0734290Z torch.manual_seed(2025) 2025-05-07T20:32:11.0734546Z 2025-05-07T20:32:11.0734837Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.0735184Z 2025-05-07T20:32:11.0735397Z x_sign = torch.sign(x) 2025-05-07T20:32:11.0735699Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:11.0736009Z x = x_sign * x_clamp 2025-05-07T20:32:11.0736262Z x0 = x[:, :D] 2025-05-07T20:32:11.0736485Z x1 = x[:, D:] 2025-05-07T20:32:11.0736697Z 2025-05-07T20:32:11.0736892Z if contiguous: 2025-05-07T20:32:11.0737133Z x0 = x0.contiguous() 2025-05-07T20:32:11.0737392Z x1 = x1.contiguous() 2025-05-07T20:32:11.0737641Z 2025-05-07T20:32:11.0737844Z if scale_ub is not None: 2025-05-07T20:32:11.0738119Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:11.0738459Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:11.0738775Z ) 2025-05-07T20:32:11.0738970Z else: 2025-05-07T20:32:11.0739190Z scale_ub_tensor = None 2025-05-07T20:32:11.0739457Z 2025-05-07T20:32:11.0739689Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.0740087Z op = silu_mul_quant 2025-05-07T20:32:11.0740353Z if compiled: 2025-05-07T20:32:11.0740638Z op = torch.compile(op) 2025-05-07T20:32:11.0740946Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.0741221Z 2025-05-07T20:32:11.0741429Z > y_fp8, y_scale = fn() 2025-05-07T20:32:11.0741597Z 2025-05-07T20:32:11.0741712Z moe/activation_test.py:117: 2025-05-07T20:32:11.0742010Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.0742353Z moe/activation_test.py:115: in fn 2025-05-07T20:32:11.0742649Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.0743209Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:11.0743774Z return fn(*args, **kwargs) 2025-05-07T20:32:11.0744455Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:11.0745151Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:11.0745685Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:11.0746371Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:11.0747118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:11.0747664Z kernel = self.compile( 2025-05-07T20:32:11.0748208Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:11.0748879Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:11.0749281Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.0749556Z 2025-05-07T20:32:11.0749766Z self = 2025-05-07T20:32:11.0750919Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:11.0752321Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f06aab4c670>} 2025-05-07T20:32:11.0753648Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:11.0754660Z context = 2025-05-07T20:32:11.0754950Z 2025-05-07T20:32:11.0755117Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:11.0755645Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:11.0756109Z module_map=module_map) 2025-05-07T20:32:11.0756477Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:11.0756828Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:11.0757085Z E ^ 2025-05-07T20:32:11.0757549Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:11.0757998Z 2025-05-07T20:32:11.0758416Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:11.0758928Z 2025-05-07T20:32:11.0759040Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.0759444Z self=, 2025-05-07T20:32:11.0759845Z T=2048, 2025-05-07T20:32:11.0760042Z D=7168, 2025-05-07T20:32:11.0760231Z scale_ub=None, 2025-05-07T20:32:11.0760445Z contiguous=True, 2025-05-07T20:32:11.0760672Z compiled=True, 2025-05-07T20:32:11.0760875Z ) 2025-05-07T20:32:11.0761195Z self = 2025-05-07T20:32:11.0761687Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:11.0761956Z 2025-05-07T20:32:11.0762038Z @given( 2025-05-07T20:32:11.0762263Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.0762575Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.0762881Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.0763202Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.0763527Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.0763815Z ) 2025-05-07T20:32:11.0764156Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.0764598Z def test_silu_mul_quant( 2025-05-07T20:32:11.0764843Z self, 2025-05-07T20:32:11.0765040Z T: int, 2025-05-07T20:32:11.0765233Z D: int, 2025-05-07T20:32:11.0765454Z scale_ub: Optional[float], 2025-05-07T20:32:11.0765721Z contiguous: bool, 2025-05-07T20:32:11.0765954Z compiled: bool, 2025-05-07T20:32:11.0766182Z ) -> None: 2025-05-07T20:32:11.0766453Z torch.manual_seed(2025) 2025-05-07T20:32:11.0766686Z 2025-05-07T20:32:11.0766957Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.0767296Z 2025-05-07T20:32:11.0767485Z x_sign = torch.sign(x) 2025-05-07T20:32:11.0767783Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:11.0768094Z x = x_sign * x_clamp 2025-05-07T20:32:11.0768331Z x0 = x[:, :D] 2025-05-07T20:32:11.0768551Z x1 = x[:, D:] 2025-05-07T20:32:11.0768757Z 2025-05-07T20:32:11.0768942Z if contiguous: 2025-05-07T20:32:11.0769222Z x0 = x0.contiguous() 2025-05-07T20:32:11.0769485Z x1 = x1.contiguous() 2025-05-07T20:32:11.0769795Z 2025-05-07T20:32:11.0769996Z if scale_ub is not None: 2025-05-07T20:32:11.0770270Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:11.0770601Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:11.0770900Z ) 2025-05-07T20:32:11.0771097Z else: 2025-05-07T20:32:11.0771315Z scale_ub_tensor = None 2025-05-07T20:32:11.0771561Z 2025-05-07T20:32:11.0771796Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.0772108Z op = silu_mul_quant 2025-05-07T20:32:11.0772354Z if compiled: 2025-05-07T20:32:11.0772608Z op = torch.compile(op) 2025-05-07T20:32:11.0772906Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.0773171Z 2025-05-07T20:32:11.0773371Z > y_fp8, y_scale = fn() 2025-05-07T20:32:11.0773537Z 2025-05-07T20:32:11.0773645Z moe/activation_test.py:117: 2025-05-07T20:32:11.0773941Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.0774272Z moe/activation_test.py:115: in fn 2025-05-07T20:32:11.0774559Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.0775111Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:11.0775662Z return fn(*args, **kwargs) 2025-05-07T20:32:11.0776314Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:11.0776995Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:11.0777524Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:11.0778199Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:11.0778858Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:11.0779391Z kernel = self.compile( 2025-05-07T20:32:11.0779987Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:11.0780638Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:11.0781034Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.0781260Z 2025-05-07T20:32:11.0781470Z self = 2025-05-07T20:32:11.0782526Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:11.0783884Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f06aab4d1b0>} 2025-05-07T20:32:11.0785227Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:11.0786238Z context = 2025-05-07T20:32:11.0786573Z 2025-05-07T20:32:11.0786743Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:11.0787258Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:11.0787722Z module_map=module_map) 2025-05-07T20:32:11.0788088Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:11.0788438Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:11.0788696Z E ^ 2025-05-07T20:32:11.0789157Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:11.0789644Z 2025-05-07T20:32:11.0790475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:11.0790996Z 2025-05-07T20:32:11.1627967Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.1628694Z self=, 2025-05-07T20:32:11.1629288Z T=16384, 2025-05-07T20:32:11.1629540Z D=5120, 2025-05-07T20:32:11.1629740Z scale_ub=None, 2025-05-07T20:32:11.1629958Z contiguous=False, 2025-05-07T20:32:11.1630191Z compiled=False, 2025-05-07T20:32:11.1630402Z ) 2025-05-07T20:32:11.1630714Z self = 2025-05-07T20:32:11.1631212Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:11.1631488Z 2025-05-07T20:32:11.1631579Z @given( 2025-05-07T20:32:11.1631809Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.1632131Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.1632439Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.1632771Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.1633097Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.1633386Z ) 2025-05-07T20:32:11.1633740Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.1634175Z def test_silu_mul_quant( 2025-05-07T20:32:11.1634424Z self, 2025-05-07T20:32:11.1634625Z T: int, 2025-05-07T20:32:11.1634825Z D: int, 2025-05-07T20:32:11.1635053Z scale_ub: Optional[float], 2025-05-07T20:32:11.1635326Z contiguous: bool, 2025-05-07T20:32:11.1635586Z compiled: bool, 2025-05-07T20:32:11.1635807Z ) -> None: 2025-05-07T20:32:11.1636025Z torch.manual_seed(2025) 2025-05-07T20:32:11.1636272Z 2025-05-07T20:32:11.1636543Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.1636888Z 2025-05-07T20:32:11.1637091Z x_sign = torch.sign(x) 2025-05-07T20:32:11.1637378Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:11.1639386Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:11.1641259Z 2025-05-07T20:32:11.1641383Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:11.1641604Z 2025-05-07T20:32:11.1641707Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.1642121Z self=, 2025-05-07T20:32:11.1642512Z T=4096, 2025-05-07T20:32:11.1642706Z D=7168, 2025-05-07T20:32:11.1642901Z scale_ub=1200.0, 2025-05-07T20:32:11.1643123Z contiguous=True, 2025-05-07T20:32:11.1643346Z compiled=True, 2025-05-07T20:32:11.1643746Z ) 2025-05-07T20:32:11.1644064Z self = 2025-05-07T20:32:11.1644547Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:11.1644817Z 2025-05-07T20:32:11.1644901Z @given( 2025-05-07T20:32:11.1645130Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.1645431Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.1645738Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.1646065Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.1646475Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.1646760Z ) 2025-05-07T20:32:11.1647227Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.1647676Z def test_silu_mul_quant( 2025-05-07T20:32:11.1647917Z self, 2025-05-07T20:32:11.1648114Z T: int, 2025-05-07T20:32:11.1648318Z D: int, 2025-05-07T20:32:11.1648535Z scale_ub: Optional[float], 2025-05-07T20:32:11.1648806Z contiguous: bool, 2025-05-07T20:32:11.1649048Z compiled: bool, 2025-05-07T20:32:11.1649267Z ) -> None: 2025-05-07T20:32:11.1649486Z torch.manual_seed(2025) 2025-05-07T20:32:11.1649729Z 2025-05-07T20:32:11.1649994Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.1650336Z 2025-05-07T20:32:11.1650532Z x_sign = torch.sign(x) 2025-05-07T20:32:11.1650818Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:11.1652805Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:11.1654705Z 2025-05-07T20:32:11.1654825Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:11.1655044Z 2025-05-07T20:32:11.1655146Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.1655560Z self=, 2025-05-07T20:32:11.1655954Z T=16384, 2025-05-07T20:32:11.1656147Z D=7168, 2025-05-07T20:32:11.1656346Z scale_ub=None, 2025-05-07T20:32:11.1656554Z contiguous=False, 2025-05-07T20:32:11.1656776Z compiled=False, 2025-05-07T20:32:11.1656982Z ) 2025-05-07T20:32:11.1657292Z self = 2025-05-07T20:32:11.1657782Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:11.1658062Z 2025-05-07T20:32:11.1658140Z @given( 2025-05-07T20:32:11.1658373Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.1658675Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.1658979Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.1659311Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.1659631Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.1660035Z ) 2025-05-07T20:32:11.1660381Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.1660814Z def test_silu_mul_quant( 2025-05-07T20:32:11.1661057Z self, 2025-05-07T20:32:11.1661252Z T: int, 2025-05-07T20:32:11.1661446Z D: int, 2025-05-07T20:32:11.1661666Z scale_ub: Optional[float], 2025-05-07T20:32:11.1661938Z contiguous: bool, 2025-05-07T20:32:11.1662176Z compiled: bool, 2025-05-07T20:32:11.1662392Z ) -> None: 2025-05-07T20:32:11.1662606Z torch.manual_seed(2025) 2025-05-07T20:32:11.1662927Z 2025-05-07T20:32:11.1663190Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.1665344Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:11.1667264Z 2025-05-07T20:32:11.1667383Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:11.1667591Z 2025-05-07T20:32:11.1667700Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.1668110Z self=, 2025-05-07T20:32:11.1668501Z T=2048, 2025-05-07T20:32:11.1668694Z D=7168, 2025-05-07T20:32:11.1668888Z scale_ub=1200.0, 2025-05-07T20:32:11.1669104Z contiguous=True, 2025-05-07T20:32:11.1669326Z compiled=True, 2025-05-07T20:32:11.1669532Z ) 2025-05-07T20:32:11.1669841Z self = 2025-05-07T20:32:11.1670329Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:11.1670595Z 2025-05-07T20:32:11.1670683Z @given( 2025-05-07T20:32:11.1670914Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.1671224Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.1671536Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.1671865Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.1672189Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.1672473Z ) 2025-05-07T20:32:11.1672827Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.1673259Z def test_silu_mul_quant( 2025-05-07T20:32:11.1673500Z self, 2025-05-07T20:32:11.1673718Z T: int, 2025-05-07T20:32:11.1673935Z D: int, 2025-05-07T20:32:11.1674157Z scale_ub: Optional[float], 2025-05-07T20:32:11.1674428Z contiguous: bool, 2025-05-07T20:32:11.1674662Z compiled: bool, 2025-05-07T20:32:11.1674884Z ) -> None: 2025-05-07T20:32:11.1675098Z torch.manual_seed(2025) 2025-05-07T20:32:11.1675333Z 2025-05-07T20:32:11.1675608Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.1675947Z 2025-05-07T20:32:11.1676139Z x_sign = torch.sign(x) 2025-05-07T20:32:11.1676433Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:11.1678404Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:11.1680270Z 2025-05-07T20:32:11.1680387Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:11.1680603Z 2025-05-07T20:32:11.1680711Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.1681120Z self=, 2025-05-07T20:32:11.1681515Z T=2048, 2025-05-07T20:32:11.1681705Z D=7168, 2025-05-07T20:32:11.1681889Z scale_ub=None, 2025-05-07T20:32:11.1682099Z contiguous=True, 2025-05-07T20:32:11.1682323Z compiled=False, 2025-05-07T20:32:11.1682581Z ) 2025-05-07T20:32:11.2950776Z self = 2025-05-07T20:32:11.2951554Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:11.2951941Z 2025-05-07T20:32:11.2952026Z @given( 2025-05-07T20:32:11.2952264Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.2952572Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.2952882Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.2953219Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.2953821Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.2954109Z ) 2025-05-07T20:32:11.2954590Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.2955039Z def test_silu_mul_quant( 2025-05-07T20:32:11.2955273Z self, 2025-05-07T20:32:11.2955469Z T: int, 2025-05-07T20:32:11.2955669Z D: int, 2025-05-07T20:32:11.2955896Z scale_ub: Optional[float], 2025-05-07T20:32:11.2956164Z contiguous: bool, 2025-05-07T20:32:11.2956404Z compiled: bool, 2025-05-07T20:32:11.2956623Z ) -> None: 2025-05-07T20:32:11.2956841Z torch.manual_seed(2025) 2025-05-07T20:32:11.2957086Z 2025-05-07T20:32:11.2957351Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.2957691Z 2025-05-07T20:32:11.2957889Z > x_sign = torch.sign(x) 2025-05-07T20:32:11.2959813Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:11.2961668Z 2025-05-07T20:32:11.2961785Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:11.2962000Z 2025-05-07T20:32:11.2962101Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.2962511Z self=, 2025-05-07T20:32:11.2962904Z T=1, 2025-05-07T20:32:11.2963079Z D=7168, 2025-05-07T20:32:11.2963272Z scale_ub=1200.0, 2025-05-07T20:32:11.2963497Z contiguous=True, 2025-05-07T20:32:11.2963719Z compiled=False, 2025-05-07T20:32:11.2963926Z ) 2025-05-07T20:32:11.2964245Z self = 2025-05-07T20:32:11.2964722Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:11.2964989Z 2025-05-07T20:32:11.2965066Z @given( 2025-05-07T20:32:11.2965296Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.2965608Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.2965906Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.2966232Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.2966557Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.2966836Z ) 2025-05-07T20:32:11.2967179Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.2967623Z def test_silu_mul_quant( 2025-05-07T20:32:11.2967859Z self, 2025-05-07T20:32:11.2968051Z T: int, 2025-05-07T20:32:11.2968251Z D: int, 2025-05-07T20:32:11.2968465Z scale_ub: Optional[float], 2025-05-07T20:32:11.2968738Z contiguous: bool, 2025-05-07T20:32:11.2968979Z compiled: bool, 2025-05-07T20:32:11.2969195Z ) -> None: 2025-05-07T20:32:11.2969410Z torch.manual_seed(2025) 2025-05-07T20:32:11.2969651Z 2025-05-07T20:32:11.2969914Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.2970339Z 2025-05-07T20:32:11.2970540Z x_sign = torch.sign(x) 2025-05-07T20:32:11.2970832Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:11.2971136Z x = x_sign * x_clamp 2025-05-07T20:32:11.2971384Z x0 = x[:, :D] 2025-05-07T20:32:11.2971601Z x1 = x[:, D:] 2025-05-07T20:32:11.2971808Z 2025-05-07T20:32:11.2971998Z if contiguous: 2025-05-07T20:32:11.2972236Z x0 = x0.contiguous() 2025-05-07T20:32:11.2972488Z x1 = x1.contiguous() 2025-05-07T20:32:11.2972809Z 2025-05-07T20:32:11.2973009Z if scale_ub is not None: 2025-05-07T20:32:11.2973282Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:11.2973693Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:11.2974053Z ) 2025-05-07T20:32:11.2974245Z else: 2025-05-07T20:32:11.2974458Z scale_ub_tensor = None 2025-05-07T20:32:11.2974847Z 2025-05-07T20:32:11.2975220Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.2975611Z op = silu_mul_quant 2025-05-07T20:32:11.2983916Z if compiled: 2025-05-07T20:32:11.2984236Z op = torch.compile(op) 2025-05-07T20:32:11.2984543Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.2984834Z 2025-05-07T20:32:11.2985046Z > y_fp8, y_scale = fn() 2025-05-07T20:32:11.2985217Z 2025-05-07T20:32:11.2985325Z moe/activation_test.py:117: 2025-05-07T20:32:11.2985634Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.2985988Z moe/activation_test.py:115: in fn 2025-05-07T20:32:11.2986281Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.2986984Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:11.2987695Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:11.2988242Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:11.2988930Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:11.2989598Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:11.2990528Z kernel = self.compile( 2025-05-07T20:32:11.2991084Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:11.2991757Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:11.2992171Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.2992404Z 2025-05-07T20:32:11.2992623Z self = 2025-05-07T20:32:11.2993706Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:11.2995084Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f06aab4ee60>} 2025-05-07T20:32:11.2996434Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:11.2997482Z context = 2025-05-07T20:32:11.2997774Z 2025-05-07T20:32:11.2997964Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:11.2998491Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:11.2998978Z module_map=module_map) 2025-05-07T20:32:11.2999479Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:11.2999837Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:11.3000110Z E ^ 2025-05-07T20:32:11.3000590Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:11.3001045Z 2025-05-07T20:32:11.3001475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:11.3001991Z 2025-05-07T20:32:11.3002105Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.3002607Z self=, 2025-05-07T20:32:11.3003131Z T=128, 2025-05-07T20:32:11.3003332Z D=5120, 2025-05-07T20:32:11.3003547Z scale_ub=None, 2025-05-07T20:32:11.3003775Z contiguous=True, 2025-05-07T20:32:11.3004016Z compiled=False, 2025-05-07T20:32:11.3004233Z ) 2025-05-07T20:32:11.3775471Z self = 2025-05-07T20:32:11.3776262Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:11.3776651Z 2025-05-07T20:32:11.3776744Z @given( 2025-05-07T20:32:11.3776985Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.3777294Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.3777606Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.3777944Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.3778273Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.3778568Z ) 2025-05-07T20:32:11.3778936Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.3779370Z def test_silu_mul_quant( 2025-05-07T20:32:11.3779615Z self, 2025-05-07T20:32:11.3779882Z T: int, 2025-05-07T20:32:11.3780087Z D: int, 2025-05-07T20:32:11.3780305Z scale_ub: Optional[float], 2025-05-07T20:32:11.3780581Z contiguous: bool, 2025-05-07T20:32:11.3780822Z compiled: bool, 2025-05-07T20:32:11.3781046Z ) -> None: 2025-05-07T20:32:11.3781268Z torch.manual_seed(2025) 2025-05-07T20:32:11.3781515Z 2025-05-07T20:32:11.3781783Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.3782125Z 2025-05-07T20:32:11.3782320Z x_sign = torch.sign(x) 2025-05-07T20:32:11.3782612Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:11.3782926Z x = x_sign * x_clamp 2025-05-07T20:32:11.3783174Z x0 = x[:, :D] 2025-05-07T20:32:11.3783393Z x1 = x[:, D:] 2025-05-07T20:32:11.3783598Z 2025-05-07T20:32:11.3783788Z if contiguous: 2025-05-07T20:32:11.3784023Z x0 = x0.contiguous() 2025-05-07T20:32:11.3784283Z x1 = x1.contiguous() 2025-05-07T20:32:11.3784523Z 2025-05-07T20:32:11.3784719Z if scale_ub is not None: 2025-05-07T20:32:11.3784992Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:11.3785334Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:11.3785648Z ) 2025-05-07T20:32:11.3785843Z else: 2025-05-07T20:32:11.3786054Z scale_ub_tensor = None 2025-05-07T20:32:11.3786307Z 2025-05-07T20:32:11.3786540Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.3786851Z op = silu_mul_quant 2025-05-07T20:32:11.3787105Z if compiled: 2025-05-07T20:32:11.3787359Z op = torch.compile(op) 2025-05-07T20:32:11.3787655Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.3787929Z 2025-05-07T20:32:11.3788129Z > y_fp8, y_scale = fn() 2025-05-07T20:32:11.3788294Z 2025-05-07T20:32:11.3788394Z moe/activation_test.py:117: 2025-05-07T20:32:11.3788691Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.3789024Z moe/activation_test.py:115: in fn 2025-05-07T20:32:11.3789557Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.3790514Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:11.3791206Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:11.3791750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:11.3792422Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:11.3793182Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:11.3793839Z kernel = self.compile( 2025-05-07T20:32:11.3794387Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:11.3795032Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:11.3795431Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.3795655Z 2025-05-07T20:32:11.3795870Z self = 2025-05-07T20:32:11.3796939Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:11.3798310Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f06aab4f7f0>} 2025-05-07T20:32:11.3799653Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:11.3800668Z context = 2025-05-07T20:32:11.3800958Z 2025-05-07T20:32:11.3801129Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:11.3801642Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:11.3802107Z module_map=module_map) 2025-05-07T20:32:11.3802474Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:11.3802827Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:11.3803082Z E ^ 2025-05-07T20:32:11.3803547Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:11.3803992Z 2025-05-07T20:32:11.3804427Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:11.3804935Z 2025-05-07T20:32:11.3805042Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.3805458Z self=, 2025-05-07T20:32:11.3805857Z T=128, 2025-05-07T20:32:11.3806051Z D=7168, 2025-05-07T20:32:11.3806241Z scale_ub=None, 2025-05-07T20:32:11.3806459Z contiguous=True, 2025-05-07T20:32:11.3806690Z compiled=False, 2025-05-07T20:32:11.3806894Z ) 2025-05-07T20:32:11.3807214Z self = 2025-05-07T20:32:11.3807704Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:11.3807971Z 2025-05-07T20:32:11.3808051Z @given( 2025-05-07T20:32:11.3808287Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.3808605Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.3808912Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.3809243Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.3809578Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.3809866Z ) 2025-05-07T20:32:11.3810283Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.3810726Z def test_silu_mul_quant( 2025-05-07T20:32:11.3810965Z self, 2025-05-07T20:32:11.3811159Z T: int, 2025-05-07T20:32:11.3811363Z D: int, 2025-05-07T20:32:11.3811586Z scale_ub: Optional[float], 2025-05-07T20:32:11.3811852Z contiguous: bool, 2025-05-07T20:32:11.3812094Z compiled: bool, 2025-05-07T20:32:11.3812316Z ) -> None: 2025-05-07T20:32:11.3812527Z torch.manual_seed(2025) 2025-05-07T20:32:11.3812818Z 2025-05-07T20:32:11.3813096Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.3813435Z 2025-05-07T20:32:11.3813710Z x_sign = torch.sign(x) 2025-05-07T20:32:11.3814007Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:11.3814310Z x = x_sign * x_clamp 2025-05-07T20:32:11.3814554Z x0 = x[:, :D] 2025-05-07T20:32:11.3814777Z x1 = x[:, D:] 2025-05-07T20:32:11.3814985Z 2025-05-07T20:32:11.3815168Z if contiguous: 2025-05-07T20:32:11.3815401Z x0 = x0.contiguous() 2025-05-07T20:32:11.3815658Z x1 = x1.contiguous() 2025-05-07T20:32:11.3815894Z 2025-05-07T20:32:11.3816089Z if scale_ub is not None: 2025-05-07T20:32:11.3816363Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:11.3816689Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:11.3816997Z ) 2025-05-07T20:32:11.3817193Z else: 2025-05-07T20:32:11.3817411Z scale_ub_tensor = None 2025-05-07T20:32:11.3817663Z 2025-05-07T20:32:11.3817899Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.3818210Z op = silu_mul_quant 2025-05-07T20:32:11.3818462Z if compiled: 2025-05-07T20:32:11.3818711Z op = torch.compile(op) 2025-05-07T20:32:11.3819003Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.3819288Z 2025-05-07T20:32:11.3819484Z > y_fp8, y_scale = fn() 2025-05-07T20:32:11.3819650Z 2025-05-07T20:32:11.3819755Z moe/activation_test.py:117: 2025-05-07T20:32:11.3820129Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.3820461Z moe/activation_test.py:115: in fn 2025-05-07T20:32:11.3820748Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.3821427Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:11.3822116Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:11.3822656Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:11.3823341Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:11.3824007Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:11.3824540Z kernel = self.compile( 2025-05-07T20:32:11.3825079Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:11.3825733Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:11.3826130Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.3826362Z 2025-05-07T20:32:11.3826572Z self = 2025-05-07T20:32:11.3827648Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:11.3829021Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f06aaa98160>} 2025-05-07T20:32:11.3830423Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:11.3831451Z context = 2025-05-07T20:32:11.3831745Z 2025-05-07T20:32:11.3831909Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:11.3832430Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:11.3832939Z module_map=module_map) 2025-05-07T20:32:11.3833407Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:11.3833768Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:11.3834033Z E ^ 2025-05-07T20:32:11.3834493Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:11.3834945Z 2025-05-07T20:32:11.3835363Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:11.3835880Z 2025-05-07T20:32:11.3835988Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.3836403Z self=, 2025-05-07T20:32:11.3836797Z T=2048, 2025-05-07T20:32:11.3836990Z D=7168, 2025-05-07T20:32:11.3837186Z scale_ub=1200.0, 2025-05-07T20:32:11.3837403Z contiguous=True, 2025-05-07T20:32:11.3837632Z compiled=False, 2025-05-07T20:32:11.3837837Z ) 2025-05-07T20:32:11.4800079Z self = 2025-05-07T20:32:11.4800725Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:11.4801002Z 2025-05-07T20:32:11.4801084Z @given( 2025-05-07T20:32:11.4801317Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.4801640Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.4801940Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.4802268Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.4802596Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.4802883Z ) 2025-05-07T20:32:11.4803222Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.4803662Z def test_silu_mul_quant( 2025-05-07T20:32:11.4803927Z self, 2025-05-07T20:32:11.4804145Z T: int, 2025-05-07T20:32:11.4804351Z D: int, 2025-05-07T20:32:11.4804573Z scale_ub: Optional[float], 2025-05-07T20:32:11.4804842Z contiguous: bool, 2025-05-07T20:32:11.4805088Z compiled: bool, 2025-05-07T20:32:11.4805312Z ) -> None: 2025-05-07T20:32:11.4805526Z torch.manual_seed(2025) 2025-05-07T20:32:11.4805770Z 2025-05-07T20:32:11.4806043Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.4808092Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:11.4809980Z 2025-05-07T20:32:11.4810106Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:11.4810325Z 2025-05-07T20:32:11.4810430Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.4810845Z self=, 2025-05-07T20:32:11.4811240Z T=1, 2025-05-07T20:32:11.4811422Z D=5120, 2025-05-07T20:32:11.4811847Z scale_ub=1200.0, 2025-05-07T20:32:11.4812073Z contiguous=True, 2025-05-07T20:32:11.4812295Z compiled=False, 2025-05-07T20:32:11.4812505Z ) 2025-05-07T20:32:11.4812825Z self = 2025-05-07T20:32:11.4813311Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:11.4813578Z 2025-05-07T20:32:11.4813657Z @given( 2025-05-07T20:32:11.4813890Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.4814205Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.4814596Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.4815046Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.4815384Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.4815668Z ) 2025-05-07T20:32:11.4816021Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.4816465Z def test_silu_mul_quant( 2025-05-07T20:32:11.4816702Z self, 2025-05-07T20:32:11.4816898Z T: int, 2025-05-07T20:32:11.4817104Z D: int, 2025-05-07T20:32:11.4817322Z scale_ub: Optional[float], 2025-05-07T20:32:11.4817594Z contiguous: bool, 2025-05-07T20:32:11.4817835Z compiled: bool, 2025-05-07T20:32:11.4818057Z ) -> None: 2025-05-07T20:32:11.4818268Z torch.manual_seed(2025) 2025-05-07T20:32:11.4818514Z 2025-05-07T20:32:11.4818786Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.4819120Z 2025-05-07T20:32:11.4819316Z x_sign = torch.sign(x) 2025-05-07T20:32:11.4819610Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:11.4820001Z x = x_sign * x_clamp 2025-05-07T20:32:11.4820243Z x0 = x[:, :D] 2025-05-07T20:32:11.4820462Z x1 = x[:, D:] 2025-05-07T20:32:11.4820665Z 2025-05-07T20:32:11.4820851Z if contiguous: 2025-05-07T20:32:11.4821086Z x0 = x0.contiguous() 2025-05-07T20:32:11.4821337Z x1 = x1.contiguous() 2025-05-07T20:32:11.4821582Z 2025-05-07T20:32:11.4821783Z if scale_ub is not None: 2025-05-07T20:32:11.4822049Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:11.4822381Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:11.4822689Z ) 2025-05-07T20:32:11.4822881Z else: 2025-05-07T20:32:11.4823098Z scale_ub_tensor = None 2025-05-07T20:32:11.4823349Z 2025-05-07T20:32:11.4823581Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.4823897Z op = silu_mul_quant 2025-05-07T20:32:11.4824149Z if compiled: 2025-05-07T20:32:11.4824403Z op = torch.compile(op) 2025-05-07T20:32:11.4824692Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.4824967Z 2025-05-07T20:32:11.4825160Z > y_fp8, y_scale = fn() 2025-05-07T20:32:11.4825325Z 2025-05-07T20:32:11.4825427Z moe/activation_test.py:117: 2025-05-07T20:32:11.4825727Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.4826057Z moe/activation_test.py:115: in fn 2025-05-07T20:32:11.4826336Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.4827022Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:11.4827713Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:11.4828251Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:11.4828931Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:11.4829591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:11.4830122Z kernel = self.compile( 2025-05-07T20:32:11.4830661Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:11.4831359Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:11.4831755Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.4831983Z 2025-05-07T20:32:11.4832195Z self = 2025-05-07T20:32:11.4833265Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:11.4834756Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f06aaa98940>} 2025-05-07T20:32:11.4836094Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:11.4837128Z context = 2025-05-07T20:32:11.4837413Z 2025-05-07T20:32:11.4837584Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:11.4838098Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:11.4838563Z module_map=module_map) 2025-05-07T20:32:11.4838931Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:11.4839275Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:11.4839534Z E ^ 2025-05-07T20:32:11.4840001Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:11.4840444Z 2025-05-07T20:32:11.4840864Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:11.4841386Z 2025-05-07T20:32:11.4841491Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.4841903Z self=, 2025-05-07T20:32:11.4842306Z T=2048, 2025-05-07T20:32:11.4842493Z D=5120, 2025-05-07T20:32:11.4842685Z scale_ub=None, 2025-05-07T20:32:11.4842901Z contiguous=True, 2025-05-07T20:32:11.4843129Z compiled=False, 2025-05-07T20:32:11.4843332Z ) 2025-05-07T20:32:11.4843652Z self = 2025-05-07T20:32:11.4844148Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:11.4844421Z 2025-05-07T20:32:11.4844498Z @given( 2025-05-07T20:32:11.4844731Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.4845044Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.4845344Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.4845677Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.4846005Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.4846287Z ) 2025-05-07T20:32:11.4846634Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.4847072Z def test_silu_mul_quant( 2025-05-07T20:32:11.4847313Z self, 2025-05-07T20:32:11.4847504Z T: int, 2025-05-07T20:32:11.4847705Z D: int, 2025-05-07T20:32:11.4847929Z scale_ub: Optional[float], 2025-05-07T20:32:11.4848198Z contiguous: bool, 2025-05-07T20:32:11.4848440Z compiled: bool, 2025-05-07T20:32:11.4848666Z ) -> None: 2025-05-07T20:32:11.4848883Z torch.manual_seed(2025) 2025-05-07T20:32:11.4849123Z 2025-05-07T20:32:11.4849397Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.4849732Z 2025-05-07T20:32:11.4849926Z > x_sign = torch.sign(x) 2025-05-07T20:32:11.4851924Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:11.4853836Z 2025-05-07T20:32:11.4853981Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:11.4854215Z 2025-05-07T20:32:11.4854397Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.4854809Z self=, 2025-05-07T20:32:11.4855210Z T=16384, 2025-05-07T20:32:11.4855408Z D=5120, 2025-05-07T20:32:11.4855595Z scale_ub=None, 2025-05-07T20:32:11.4855810Z contiguous=True, 2025-05-07T20:32:11.4856033Z compiled=False, 2025-05-07T20:32:11.4856229Z ) 2025-05-07T20:32:11.5811161Z self = 2025-05-07T20:32:11.5812301Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:11.5812885Z 2025-05-07T20:32:11.5813021Z @given( 2025-05-07T20:32:11.5813379Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.5813828Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.5814206Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.5814545Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.5814889Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.5815180Z ) 2025-05-07T20:32:11.5815526Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.5815971Z def test_silu_mul_quant( 2025-05-07T20:32:11.5816219Z self, 2025-05-07T20:32:11.5816430Z T: int, 2025-05-07T20:32:11.5816634Z D: int, 2025-05-07T20:32:11.5816854Z scale_ub: Optional[float], 2025-05-07T20:32:11.5817129Z contiguous: bool, 2025-05-07T20:32:11.5817516Z compiled: bool, 2025-05-07T20:32:11.5817801Z ) -> None: 2025-05-07T20:32:11.5825827Z torch.manual_seed(2025) 2025-05-07T20:32:11.5826133Z 2025-05-07T20:32:11.5826426Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.5828507Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:11.5830374Z 2025-05-07T20:32:11.5830509Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:11.5830724Z 2025-05-07T20:32:11.5830833Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.5831260Z self=, 2025-05-07T20:32:11.5831677Z T=4096, 2025-05-07T20:32:11.5831878Z D=5120, 2025-05-07T20:32:11.5832075Z scale_ub=None, 2025-05-07T20:32:11.5832306Z contiguous=True, 2025-05-07T20:32:11.5832543Z compiled=False, 2025-05-07T20:32:11.5832762Z ) 2025-05-07T20:32:11.5833093Z self = 2025-05-07T20:32:11.5833595Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:11.5833869Z 2025-05-07T20:32:11.5833951Z @given( 2025-05-07T20:32:11.5834193Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.5834712Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.5835021Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.5835358Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.5835699Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.5835992Z ) 2025-05-07T20:32:11.5836349Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.5836798Z def test_silu_mul_quant( 2025-05-07T20:32:11.5837052Z self, 2025-05-07T20:32:11.5837340Z T: int, 2025-05-07T20:32:11.5837548Z D: int, 2025-05-07T20:32:11.5837786Z scale_ub: Optional[float], 2025-05-07T20:32:11.5838184Z contiguous: bool, 2025-05-07T20:32:11.5838443Z compiled: bool, 2025-05-07T20:32:11.5838684Z ) -> None: 2025-05-07T20:32:11.5838908Z torch.manual_seed(2025) 2025-05-07T20:32:11.5839162Z 2025-05-07T20:32:11.5839447Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.5841533Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:11.5843379Z 2025-05-07T20:32:11.5843518Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:11.5843732Z 2025-05-07T20:32:11.5843844Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.5844272Z self=, 2025-05-07T20:32:11.5844679Z T=2048, 2025-05-07T20:32:11.5844869Z D=5120, 2025-05-07T20:32:11.5845076Z scale_ub=None, 2025-05-07T20:32:11.5845303Z contiguous=False, 2025-05-07T20:32:11.5845539Z compiled=False, 2025-05-07T20:32:11.5845761Z ) 2025-05-07T20:32:11.5846096Z self = 2025-05-07T20:32:11.5846597Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:11.5846883Z 2025-05-07T20:32:11.5846966Z @given( 2025-05-07T20:32:11.5847216Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.5847547Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.5847865Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.5848215Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.5848563Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.5848861Z ) 2025-05-07T20:32:11.5849228Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.5849691Z def test_silu_mul_quant( 2025-05-07T20:32:11.5849940Z self, 2025-05-07T20:32:11.5850156Z T: int, 2025-05-07T20:32:11.5850372Z D: int, 2025-05-07T20:32:11.5850603Z scale_ub: Optional[float], 2025-05-07T20:32:11.5850890Z contiguous: bool, 2025-05-07T20:32:11.5851146Z compiled: bool, 2025-05-07T20:32:11.5851384Z ) -> None: 2025-05-07T20:32:11.5851610Z torch.manual_seed(2025) 2025-05-07T20:32:11.5851872Z 2025-05-07T20:32:11.5852159Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.5854208Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:11.5856137Z 2025-05-07T20:32:11.5856267Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:11.5856481Z 2025-05-07T20:32:11.5856594Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.5857028Z self=, 2025-05-07T20:32:11.5857438Z T=4096, 2025-05-07T20:32:11.5857633Z D=7168, 2025-05-07T20:32:11.5857886Z scale_ub=None, 2025-05-07T20:32:11.5858116Z contiguous=True, 2025-05-07T20:32:11.5858345Z compiled=True, 2025-05-07T20:32:11.5858640Z ) 2025-05-07T20:32:11.5858973Z self = 2025-05-07T20:32:11.5859475Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:11.5859750Z 2025-05-07T20:32:11.5859935Z @given( 2025-05-07T20:32:11.5860205Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.5860533Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.5860847Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.5861194Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.5861542Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.5861833Z ) 2025-05-07T20:32:11.5862194Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.5862653Z def test_silu_mul_quant( 2025-05-07T20:32:11.5862912Z self, 2025-05-07T20:32:11.5863117Z T: int, 2025-05-07T20:32:11.5863330Z D: int, 2025-05-07T20:32:11.5863561Z scale_ub: Optional[float], 2025-05-07T20:32:11.5863865Z contiguous: bool, 2025-05-07T20:32:11.5864139Z compiled: bool, 2025-05-07T20:32:11.5864373Z ) -> None: 2025-05-07T20:32:11.5864596Z torch.manual_seed(2025) 2025-05-07T20:32:11.5864855Z 2025-05-07T20:32:11.5865135Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.5867180Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:11.5869037Z 2025-05-07T20:32:11.5869161Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:11.5869384Z 2025-05-07T20:32:11.5869496Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.5869925Z self=, 2025-05-07T20:32:11.5870343Z T=2048, 2025-05-07T20:32:11.5870538Z D=5120, 2025-05-07T20:32:11.5870745Z scale_ub=1200.0, 2025-05-07T20:32:11.5870979Z contiguous=False, 2025-05-07T20:32:11.5871216Z compiled=False, 2025-05-07T20:32:11.5871434Z ) 2025-05-07T20:32:11.5871763Z self = 2025-05-07T20:32:11.5872263Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:11.5872547Z 2025-05-07T20:32:11.5872629Z @given( 2025-05-07T20:32:11.5872874Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.5873194Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.5873511Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.5873855Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.5874201Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.5874494Z ) 2025-05-07T20:32:11.5874908Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.5875367Z def test_silu_mul_quant( 2025-05-07T20:32:11.5875616Z self, 2025-05-07T20:32:11.5875825Z T: int, 2025-05-07T20:32:11.5876039Z D: int, 2025-05-07T20:32:11.5876266Z scale_ub: Optional[float], 2025-05-07T20:32:11.5876549Z contiguous: bool, 2025-05-07T20:32:11.5876808Z compiled: bool, 2025-05-07T20:32:11.5877039Z ) -> None: 2025-05-07T20:32:11.5877270Z torch.manual_seed(2025) 2025-05-07T20:32:11.5877574Z 2025-05-07T20:32:11.5877855Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.5880053Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:11.5881903Z 2025-05-07T20:32:11.5882026Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:11.5882248Z 2025-05-07T20:32:11.5882360Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.5882785Z self=, 2025-05-07T20:32:11.5883200Z T=4096, 2025-05-07T20:32:11.5883402Z D=7168, 2025-05-07T20:32:11.5883608Z scale_ub=1200.0, 2025-05-07T20:32:11.5883841Z contiguous=True, 2025-05-07T20:32:11.5884077Z compiled=False, 2025-05-07T20:32:11.5884297Z ) 2025-05-07T20:32:11.7139580Z self = 2025-05-07T20:32:11.7140421Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:11.7140831Z 2025-05-07T20:32:11.7140940Z @given( 2025-05-07T20:32:11.7141197Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.7141502Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.7141809Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.7142139Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.7142467Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.7142744Z ) 2025-05-07T20:32:11.7143096Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.7143538Z def test_silu_mul_quant( 2025-05-07T20:32:11.7143778Z self, 2025-05-07T20:32:11.7143983Z T: int, 2025-05-07T20:32:11.7144181Z D: int, 2025-05-07T20:32:11.7144397Z scale_ub: Optional[float], 2025-05-07T20:32:11.7144666Z contiguous: bool, 2025-05-07T20:32:11.7144905Z compiled: bool, 2025-05-07T20:32:11.7145128Z ) -> None: 2025-05-07T20:32:11.7145346Z torch.manual_seed(2025) 2025-05-07T20:32:11.7145590Z 2025-05-07T20:32:11.7145857Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.7147950Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:11.7149801Z 2025-05-07T20:32:11.7149925Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:11.7150132Z 2025-05-07T20:32:11.7150239Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.7150912Z self=, 2025-05-07T20:32:11.7151312Z T=16384, 2025-05-07T20:32:11.7151509Z D=7168, 2025-05-07T20:32:11.7151694Z scale_ub=None, 2025-05-07T20:32:11.7151911Z contiguous=False, 2025-05-07T20:32:11.7152140Z compiled=True, 2025-05-07T20:32:11.7152338Z ) 2025-05-07T20:32:11.7152664Z self = 2025-05-07T20:32:11.7153156Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:11.7153519Z 2025-05-07T20:32:11.7153597Z @given( 2025-05-07T20:32:11.7153827Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.7154302Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.7154616Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.7154938Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.7155263Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.7155551Z ) 2025-05-07T20:32:11.7155893Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.7156332Z def test_silu_mul_quant( 2025-05-07T20:32:11.7156572Z self, 2025-05-07T20:32:11.7156780Z T: int, 2025-05-07T20:32:11.7156975Z D: int, 2025-05-07T20:32:11.7157199Z scale_ub: Optional[float], 2025-05-07T20:32:11.7157472Z contiguous: bool, 2025-05-07T20:32:11.7157707Z compiled: bool, 2025-05-07T20:32:11.7157934Z ) -> None: 2025-05-07T20:32:11.7158160Z torch.manual_seed(2025) 2025-05-07T20:32:11.7158403Z 2025-05-07T20:32:11.7158682Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.7160711Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:11.7162553Z 2025-05-07T20:32:11.7162678Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:11.7162888Z 2025-05-07T20:32:11.7162999Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.7163408Z self=, 2025-05-07T20:32:11.7163808Z T=4096, 2025-05-07T20:32:11.7164005Z D=7168, 2025-05-07T20:32:11.7164192Z scale_ub=None, 2025-05-07T20:32:11.7164408Z contiguous=True, 2025-05-07T20:32:11.7164640Z compiled=False, 2025-05-07T20:32:11.7164840Z ) 2025-05-07T20:32:11.7165168Z self = 2025-05-07T20:32:11.7165662Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:11.7165926Z 2025-05-07T20:32:11.7166015Z @given( 2025-05-07T20:32:11.7166240Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.7166551Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.7166857Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.7167179Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.7167506Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.7167794Z ) 2025-05-07T20:32:11.7168136Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.7168582Z def test_silu_mul_quant( 2025-05-07T20:32:11.7168825Z self, 2025-05-07T20:32:11.7169018Z T: int, 2025-05-07T20:32:11.7169214Z D: int, 2025-05-07T20:32:11.7169434Z scale_ub: Optional[float], 2025-05-07T20:32:11.7169699Z contiguous: bool, 2025-05-07T20:32:11.7169996Z compiled: bool, 2025-05-07T20:32:11.7170221Z ) -> None: 2025-05-07T20:32:11.7170436Z torch.manual_seed(2025) 2025-05-07T20:32:11.7170674Z 2025-05-07T20:32:11.7170946Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.7173083Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:11.7174998Z 2025-05-07T20:32:11.7175123Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:11.7175338Z 2025-05-07T20:32:11.7175443Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.7175858Z self=, 2025-05-07T20:32:11.7176258Z T=16384, 2025-05-07T20:32:11.7176454Z D=7168, 2025-05-07T20:32:11.7176640Z scale_ub=None, 2025-05-07T20:32:11.7176859Z contiguous=True, 2025-05-07T20:32:11.7177085Z compiled=False, 2025-05-07T20:32:11.7177285Z ) 2025-05-07T20:32:11.7177600Z self = 2025-05-07T20:32:11.7178095Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:11.7178367Z 2025-05-07T20:32:11.7178446Z @given( 2025-05-07T20:32:11.7178684Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.7178996Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.7179296Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.7179625Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.7180051Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.7180333Z ) 2025-05-07T20:32:11.7180676Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.7181116Z def test_silu_mul_quant( 2025-05-07T20:32:11.7181360Z self, 2025-05-07T20:32:11.7181551Z T: int, 2025-05-07T20:32:11.7181750Z D: int, 2025-05-07T20:32:11.7181972Z scale_ub: Optional[float], 2025-05-07T20:32:11.7182239Z contiguous: bool, 2025-05-07T20:32:11.7182484Z compiled: bool, 2025-05-07T20:32:11.7182706Z ) -> None: 2025-05-07T20:32:11.7182917Z torch.manual_seed(2025) 2025-05-07T20:32:11.7183166Z 2025-05-07T20:32:11.7183436Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.7185457Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:11.7187290Z 2025-05-07T20:32:11.7187417Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:11.7187625Z 2025-05-07T20:32:11.7187735Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.7188144Z self=, 2025-05-07T20:32:11.7188549Z T=16384, 2025-05-07T20:32:11.7188734Z D=7168, 2025-05-07T20:32:11.7188923Z scale_ub=1200.0, 2025-05-07T20:32:11.7189145Z contiguous=True, 2025-05-07T20:32:11.7189359Z compiled=False, 2025-05-07T20:32:11.7189560Z ) 2025-05-07T20:32:11.7190212Z self = 2025-05-07T20:32:11.7190705Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:11.7190983Z 2025-05-07T20:32:11.7191059Z @given( 2025-05-07T20:32:11.7191286Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.7191595Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.7191893Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.7192221Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.7192633Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.7192911Z ) 2025-05-07T20:32:11.7193369Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.7193810Z def test_silu_mul_quant( 2025-05-07T20:32:11.7194048Z self, 2025-05-07T20:32:11.7194241Z T: int, 2025-05-07T20:32:11.7194438Z D: int, 2025-05-07T20:32:11.7194656Z scale_ub: Optional[float], 2025-05-07T20:32:11.7194926Z contiguous: bool, 2025-05-07T20:32:11.7195171Z compiled: bool, 2025-05-07T20:32:11.7195393Z ) -> None: 2025-05-07T20:32:11.7195608Z torch.manual_seed(2025) 2025-05-07T20:32:11.7195847Z 2025-05-07T20:32:11.7196122Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.7198141Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:11.7199985Z 2025-05-07T20:32:11.7200103Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:11.7200316Z 2025-05-07T20:32:11.7200423Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.7200831Z self=, 2025-05-07T20:32:11.7201226Z T=128, 2025-05-07T20:32:11.7201406Z D=5120, 2025-05-07T20:32:11.7201599Z scale_ub=1200.0, 2025-05-07T20:32:11.7201823Z contiguous=False, 2025-05-07T20:32:11.7202040Z compiled=False, 2025-05-07T20:32:11.7202244Z ) 2025-05-07T20:32:12.0571707Z self = 2025-05-07T20:32:12.0572479Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:12.0572881Z 2025-05-07T20:32:12.0573022Z @given( 2025-05-07T20:32:12.0573349Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.0573703Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.0574009Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.0574359Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.0574691Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.0574974Z ) 2025-05-07T20:32:12.0575326Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.0575764Z def test_silu_mul_quant( 2025-05-07T20:32:12.0576009Z self, 2025-05-07T20:32:12.0576207Z T: int, 2025-05-07T20:32:12.0576409Z D: int, 2025-05-07T20:32:12.0576628Z scale_ub: Optional[float], 2025-05-07T20:32:12.0576907Z contiguous: bool, 2025-05-07T20:32:12.0577148Z compiled: bool, 2025-05-07T20:32:12.0577375Z ) -> None: 2025-05-07T20:32:12.0577604Z torch.manual_seed(2025) 2025-05-07T20:32:12.0577857Z 2025-05-07T20:32:12.0578132Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.0578471Z 2025-05-07T20:32:12.0578666Z x_sign = torch.sign(x) 2025-05-07T20:32:12.0579188Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.0579492Z x = x_sign * x_clamp 2025-05-07T20:32:12.0579736Z x0 = x[:, :D] 2025-05-07T20:32:12.0580055Z x1 = x[:, D:] 2025-05-07T20:32:12.0580257Z 2025-05-07T20:32:12.0580447Z if contiguous: 2025-05-07T20:32:12.0580683Z x0 = x0.contiguous() 2025-05-07T20:32:12.0580939Z x1 = x1.contiguous() 2025-05-07T20:32:12.0581178Z 2025-05-07T20:32:12.0581377Z if scale_ub is not None: 2025-05-07T20:32:12.0581742Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:12.0582082Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:12.0582518Z ) 2025-05-07T20:32:12.0582711Z else: 2025-05-07T20:32:12.0582927Z scale_ub_tensor = None 2025-05-07T20:32:12.0583179Z 2025-05-07T20:32:12.0583407Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.0583728Z op = silu_mul_quant 2025-05-07T20:32:12.0583982Z if compiled: 2025-05-07T20:32:12.0584233Z op = torch.compile(op) 2025-05-07T20:32:12.0584526Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.0584796Z 2025-05-07T20:32:12.0584992Z > y_fp8, y_scale = fn() 2025-05-07T20:32:12.0585155Z 2025-05-07T20:32:12.0585259Z moe/activation_test.py:117: 2025-05-07T20:32:12.0585559Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.0585894Z moe/activation_test.py:115: in fn 2025-05-07T20:32:12.0586176Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.0586876Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:12.0587575Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:12.0588112Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:12.0588787Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:12.0589449Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:12.0590238Z kernel = self.compile( 2025-05-07T20:32:12.0590779Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:12.0591438Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:12.0591846Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.0592076Z 2025-05-07T20:32:12.0592299Z self = 2025-05-07T20:32:12.0593368Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:12.0594807Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f06aa858940>} 2025-05-07T20:32:12.0596140Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:12.0597174Z context = 2025-05-07T20:32:12.0597462Z 2025-05-07T20:32:12.0597655Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:12.0598180Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:12.0598644Z module_map=module_map) 2025-05-07T20:32:12.0599013Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:12.0599443Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:12.0599701Z E ^ 2025-05-07T20:32:12.0600286Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:12.0600760Z 2025-05-07T20:32:12.0608260Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:12.0608828Z 2025-05-07T20:32:12.0608941Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.0609369Z self=, 2025-05-07T20:32:12.0609902Z T=2048, 2025-05-07T20:32:12.0610096Z D=7168, 2025-05-07T20:32:12.0610408Z scale_ub=None, 2025-05-07T20:32:12.0610641Z contiguous=False, 2025-05-07T20:32:12.0610876Z compiled=False, 2025-05-07T20:32:12.0611094Z ) 2025-05-07T20:32:12.0611422Z self = 2025-05-07T20:32:12.0611925Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:12.0612207Z 2025-05-07T20:32:12.0612287Z @given( 2025-05-07T20:32:12.0612531Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.0612845Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.0613163Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.0613507Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.0613845Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.0614160Z ) 2025-05-07T20:32:12.0614544Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.0615002Z def test_silu_mul_quant( 2025-05-07T20:32:12.0615250Z self, 2025-05-07T20:32:12.0615461Z T: int, 2025-05-07T20:32:12.0615668Z D: int, 2025-05-07T20:32:12.0615890Z scale_ub: Optional[float], 2025-05-07T20:32:12.0616172Z contiguous: bool, 2025-05-07T20:32:12.0616423Z compiled: bool, 2025-05-07T20:32:12.0616653Z ) -> None: 2025-05-07T20:32:12.0616879Z torch.manual_seed(2025) 2025-05-07T20:32:12.0617131Z 2025-05-07T20:32:12.0617411Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.0619479Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:12.0621423Z 2025-05-07T20:32:12.0621548Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:12.0621769Z 2025-05-07T20:32:12.0621879Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.0622300Z self=, 2025-05-07T20:32:12.0622707Z T=128, 2025-05-07T20:32:12.0622905Z D=7168, 2025-05-07T20:32:12.0623108Z scale_ub=1200.0, 2025-05-07T20:32:12.0623333Z contiguous=True, 2025-05-07T20:32:12.0623569Z compiled=True, 2025-05-07T20:32:12.0623781Z ) 2025-05-07T20:32:12.1039304Z self = 2025-05-07T20:32:12.1040424Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:12.1041008Z 2025-05-07T20:32:12.1041168Z @given( 2025-05-07T20:32:12.1041658Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.1042134Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.1042608Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.1043123Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.1043829Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.1044175Z ) 2025-05-07T20:32:12.1044564Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.1045013Z def test_silu_mul_quant( 2025-05-07T20:32:12.1045261Z self, 2025-05-07T20:32:12.1045470Z T: int, 2025-05-07T20:32:12.1045680Z D: int, 2025-05-07T20:32:12.1045905Z scale_ub: Optional[float], 2025-05-07T20:32:12.1046214Z contiguous: bool, 2025-05-07T20:32:12.1046468Z compiled: bool, 2025-05-07T20:32:12.1046785Z ) -> None: 2025-05-07T20:32:12.1047007Z torch.manual_seed(2025) 2025-05-07T20:32:12.1047265Z 2025-05-07T20:32:12.1047660Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.1048010Z 2025-05-07T20:32:12.1048209Z x_sign = torch.sign(x) 2025-05-07T20:32:12.1048513Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.1048818Z x = x_sign * x_clamp 2025-05-07T20:32:12.1049064Z x0 = x[:, :D] 2025-05-07T20:32:12.1049285Z x1 = x[:, D:] 2025-05-07T20:32:12.1049488Z 2025-05-07T20:32:12.1049680Z if contiguous: 2025-05-07T20:32:12.1049918Z x0 = x0.contiguous() 2025-05-07T20:32:12.1050174Z x1 = x1.contiguous() 2025-05-07T20:32:12.1050419Z 2025-05-07T20:32:12.1050619Z if scale_ub is not None: 2025-05-07T20:32:12.1050890Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:12.1051226Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:12.1051539Z ) 2025-05-07T20:32:12.1051733Z else: 2025-05-07T20:32:12.1051947Z scale_ub_tensor = None 2025-05-07T20:32:12.1052210Z 2025-05-07T20:32:12.1052441Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.1052755Z op = silu_mul_quant 2025-05-07T20:32:12.1053014Z if compiled: 2025-05-07T20:32:12.1053265Z op = torch.compile(op) 2025-05-07T20:32:12.1053564Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.1053840Z 2025-05-07T20:32:12.1054043Z > y_fp8, y_scale = fn() 2025-05-07T20:32:12.1054206Z 2025-05-07T20:32:12.1054307Z moe/activation_test.py:117: 2025-05-07T20:32:12.1054610Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.1054943Z moe/activation_test.py:115: in fn 2025-05-07T20:32:12.1055221Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.1055780Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:12.1056342Z return fn(*args, **kwargs) 2025-05-07T20:32:12.1057009Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:12.1057690Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:12.1058224Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:12.1058909Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:12.1059563Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:12.1060235Z kernel = self.compile( 2025-05-07T20:32:12.1060778Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:12.1061432Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:12.1061825Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.1062064Z 2025-05-07T20:32:12.1062274Z self = 2025-05-07T20:32:12.1063363Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:12.1064807Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f06aa858dc0>} 2025-05-07T20:32:12.1066132Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:12.1067210Z context = 2025-05-07T20:32:12.1067503Z 2025-05-07T20:32:12.1067743Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:12.1068270Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:12.1068732Z module_map=module_map) 2025-05-07T20:32:12.1069107Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:12.1069462Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:12.1069723Z E ^ 2025-05-07T20:32:12.1070182Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:12.1070630Z 2025-05-07T20:32:12.1071044Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:12.1071550Z 2025-05-07T20:32:12.1071660Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.1072071Z self=, 2025-05-07T20:32:12.1072475Z T=128, 2025-05-07T20:32:12.1072671Z D=7168, 2025-05-07T20:32:12.1072870Z scale_ub=1200.0, 2025-05-07T20:32:12.1073091Z contiguous=True, 2025-05-07T20:32:12.1073319Z compiled=False, 2025-05-07T20:32:12.1073526Z ) 2025-05-07T20:32:12.1073842Z self = 2025-05-07T20:32:12.1074342Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:12.1074611Z 2025-05-07T20:32:12.1074699Z @given( 2025-05-07T20:32:12.1074931Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.1075247Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.1075562Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.1075885Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.1076220Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.1076517Z ) 2025-05-07T20:32:12.1076877Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.1077314Z def test_silu_mul_quant( 2025-05-07T20:32:12.1077558Z self, 2025-05-07T20:32:12.1077756Z T: int, 2025-05-07T20:32:12.1077950Z D: int, 2025-05-07T20:32:12.1078176Z scale_ub: Optional[float], 2025-05-07T20:32:12.1078454Z contiguous: bool, 2025-05-07T20:32:12.1078690Z compiled: bool, 2025-05-07T20:32:12.1078924Z ) -> None: 2025-05-07T20:32:12.1079143Z torch.manual_seed(2025) 2025-05-07T20:32:12.1079383Z 2025-05-07T20:32:12.1079658Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.1080002Z 2025-05-07T20:32:12.1080191Z x_sign = torch.sign(x) 2025-05-07T20:32:12.1080487Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.1082487Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:12.1084429Z 2025-05-07T20:32:12.1084553Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:12.1084765Z 2025-05-07T20:32:12.1084878Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.1085288Z self=, 2025-05-07T20:32:12.1085690Z T=128, 2025-05-07T20:32:12.1085884Z D=5120, 2025-05-07T20:32:12.1086073Z scale_ub=1200.0, 2025-05-07T20:32:12.1086296Z contiguous=True, 2025-05-07T20:32:12.1086565Z compiled=True, 2025-05-07T20:32:12.1086768Z ) 2025-05-07T20:32:12.1087196Z self = 2025-05-07T20:32:12.1087691Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:12.1087956Z 2025-05-07T20:32:12.1088039Z @given( 2025-05-07T20:32:12.1088269Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.1088588Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.1088895Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.1089222Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.1089556Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.1090173Z ) 2025-05-07T20:32:12.1090530Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.1090972Z def test_silu_mul_quant( 2025-05-07T20:32:12.1091222Z self, 2025-05-07T20:32:12.1091428Z T: int, 2025-05-07T20:32:12.1091623Z D: int, 2025-05-07T20:32:12.1091848Z scale_ub: Optional[float], 2025-05-07T20:32:12.1092128Z contiguous: bool, 2025-05-07T20:32:12.1092367Z compiled: bool, 2025-05-07T20:32:12.1092597Z ) -> None: 2025-05-07T20:32:12.1092818Z torch.manual_seed(2025) 2025-05-07T20:32:12.1093056Z 2025-05-07T20:32:12.1093335Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.1093683Z 2025-05-07T20:32:12.1093877Z x_sign = torch.sign(x) 2025-05-07T20:32:12.1094173Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.1096208Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:12.1098032Z 2025-05-07T20:32:12.1098160Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:12.1098396Z 2025-05-07T20:32:12.1098510Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.1098927Z self=, 2025-05-07T20:32:12.1099328Z T=128, 2025-05-07T20:32:12.1099521Z D=7168, 2025-05-07T20:32:12.1099718Z scale_ub=None, 2025-05-07T20:32:12.1100005Z contiguous=True, 2025-05-07T20:32:12.1100232Z compiled=True, 2025-05-07T20:32:12.1100436Z ) 2025-05-07T20:32:12.3543886Z self = 2025-05-07T20:32:12.3544579Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:12.3544871Z 2025-05-07T20:32:12.3544951Z @given( 2025-05-07T20:32:12.3545190Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.3545511Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.3545822Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.3546157Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.3546481Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.3547039Z ) 2025-05-07T20:32:12.3547391Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.3547824Z def test_silu_mul_quant( 2025-05-07T20:32:12.3548071Z self, 2025-05-07T20:32:12.3548267Z T: int, 2025-05-07T20:32:12.3548462Z D: int, 2025-05-07T20:32:12.3548687Z scale_ub: Optional[float], 2025-05-07T20:32:12.3548963Z contiguous: bool, 2025-05-07T20:32:12.3549208Z compiled: bool, 2025-05-07T20:32:12.3549430Z ) -> None: 2025-05-07T20:32:12.3549739Z torch.manual_seed(2025) 2025-05-07T20:32:12.3549982Z 2025-05-07T20:32:12.3550381Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.3552410Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:12.3554254Z 2025-05-07T20:32:12.3554375Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:12.3554589Z 2025-05-07T20:32:12.3586340Z FAILED 2025-05-07T20:32:12.3586469Z 2025-05-07T20:32:12.3586646Z =================================== FAILURES =================================== 2025-05-07T20:32:12.3587274Z _____________________ ActivationTests.test_silu_mul_quant ______________________ 2025-05-07T20:32:12.3587900Z + Exception Group Traceback (most recent call last): 2025-05-07T20:32:12.3588751Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/unittest/case.py", line 59, in testPartExecutor 2025-05-07T20:32:12.3589516Z | yield 2025-05-07T20:32:12.3590452Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/unittest/case.py", line 591, in run 2025-05-07T20:32:12.3591191Z | self._callTestMethod(testMethod) 2025-05-07T20:32:12.3591972Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/unittest/case.py", line 549, in _callTestMethod 2025-05-07T20:32:12.3592715Z | method() 2025-05-07T20:32:12.3593590Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant 2025-05-07T20:32:12.3594612Z | T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.3595506Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/hypothesis/core.py", line 1850, in wrapped_test 2025-05-07T20:32:12.3596353Z | raise the_error_hypothesis_found 2025-05-07T20:32:12.3597037Z | exceptiongroup.ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions) 2025-05-07T20:32:12.3597706Z +-+---------------- 1 ---------------- 2025-05-07T20:32:12.3598115Z | Traceback (most recent call last): 2025-05-07T20:32:12.3599083Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:32:12.3600165Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.3603021Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:12.3605935Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:12.3606543Z | self=, 2025-05-07T20:32:12.3607104Z | T=2048, 2025-05-07T20:32:12.3607418Z | D=5120, # or any other generated value 2025-05-07T20:32:12.3607880Z | scale_ub=None, # or any other generated value 2025-05-07T20:32:12.3608372Z | contiguous=True, # or any other generated value 2025-05-07T20:32:12.3608874Z | compiled=False, # or any other generated value 2025-05-07T20:32:12.3609381Z | ) 2025-05-07T20:32:12.3609621Z | 2025-05-07T20:32:12.3610475Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEECQQBBAEEAQQE=') as a decorator on your test case 2025-05-07T20:32:12.3611327Z +---------------- 2 ---------------- 2025-05-07T20:32:12.3611742Z | Traceback (most recent call last): 2025-05-07T20:32:12.3612755Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:32:12.3613844Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.3616700Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:12.3618756Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:12.3620169Z | self=, 2025-05-07T20:32:12.3620747Z | T=128, 2025-05-07T20:32:12.3621016Z | D=7168, 2025-05-07T20:32:12.3621300Z | scale_ub=None, 2025-05-07T20:32:12.3621631Z | contiguous=True, 2025-05-07T20:32:12.3621957Z | compiled=True, 2025-05-07T20:32:12.3622262Z | ) 2025-05-07T20:32:12.3622506Z | 2025-05-07T20:32:12.3623225Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case 2025-05-07T20:32:12.3623876Z +---------------- 3 ---------------- 2025-05-07T20:32:12.3624177Z | Traceback (most recent call last): 2025-05-07T20:32:12.3624890Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:32:12.3625660Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.3627697Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:12.3629692Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:12.3630140Z | self=, 2025-05-07T20:32:12.3630542Z | T=128, 2025-05-07T20:32:12.3630747Z | D=5120, 2025-05-07T20:32:12.3630962Z | scale_ub=1200.0, 2025-05-07T20:32:12.3631210Z | contiguous=True, 2025-05-07T20:32:12.3631446Z | compiled=True, 2025-05-07T20:32:12.3631676Z | ) 2025-05-07T20:32:12.3631934Z | 2025-05-07T20:32:12.3632452Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case 2025-05-07T20:32:12.3633059Z +---------------- 4 ---------------- 2025-05-07T20:32:12.3633352Z | Traceback (most recent call last): 2025-05-07T20:32:12.3634053Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant 2025-05-07T20:32:12.3634759Z | y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:12.3635531Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn 2025-05-07T20:32:12.3636229Z | return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:12.3637065Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row 2025-05-07T20:32:12.3637864Z | _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:12.3638474Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py", line 330, in 2025-05-07T20:32:12.3639202Z | return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:12.3639934Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 186, in run 2025-05-07T20:32:12.3640702Z | timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:12.3641506Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 186, in 2025-05-07T20:32:12.3642302Z | timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:12.3643071Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 166, in _bench 2025-05-07T20:32:12.3643760Z | return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:12.3644409Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py", line 117, in do_bench 2025-05-07T20:32:12.3644969Z | fn() 2025-05-07T20:32:12.3645530Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call 2025-05-07T20:32:12.3646300Z | self.fn.run( 2025-05-07T20:32:12.3647037Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py", line 623, in run 2025-05-07T20:32:12.3647830Z | kernel = self.compile( 2025-05-07T20:32:12.3648663Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 273, in compile 2025-05-07T20:32:12.3649650Z | module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:12.3650639Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:12.3651723Z | return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:12.3652445Z | triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:12.3652934Z | def _kernel_quantize_fp8_row( 2025-05-07T20:32:12.3653305Z | ^ 2025-05-07T20:32:12.3653934Z | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:12.3654725Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:12.3655284Z | # The test always failed when commented parts were varied together. 2025-05-07T20:32:12.3656001Z | self=, 2025-05-07T20:32:12.3656678Z | T=1, # or any other generated value 2025-05-07T20:32:12.3657120Z | D=5120, # or any other generated value 2025-05-07T20:32:12.3657595Z | scale_ub=None, # or any other generated value 2025-05-07T20:32:12.3658099Z | contiguous=True, # or any other generated value 2025-05-07T20:32:12.3658624Z | compiled=True, # or any other generated value 2025-05-07T20:32:12.3659038Z | ) 2025-05-07T20:32:12.3659298Z | 2025-05-07T20:32:12.3660190Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case 2025-05-07T20:32:12.3681615Z +------------------------------------ 2025-05-07T20:32:12.3682137Z ---------------------------------- Hypothesis ---------------------------------- 2025-05-07T20:32:12.3682654Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.3683237Z self=, 2025-05-07T20:32:12.3683807Z T=1, 2025-05-07T20:32:12.3684068Z D=5120, 2025-05-07T20:32:12.3684345Z scale_ub=None, 2025-05-07T20:32:12.3684657Z contiguous=True, 2025-05-07T20:32:12.3684975Z compiled=True, 2025-05-07T20:32:12.3685277Z ) 2025-05-07T20:32:12.3685735Z self = 2025-05-07T20:32:12.3686400Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:12.3686766Z 2025-05-07T20:32:12.3686876Z @given( 2025-05-07T20:32:12.3687199Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.3687632Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.3688060Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.3688503Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.3688953Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.3689332Z ) 2025-05-07T20:32:12.3689805Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.3690715Z def test_silu_mul_quant( 2025-05-07T20:32:12.3691040Z self, 2025-05-07T20:32:12.3691307Z T: int, 2025-05-07T20:32:12.3691580Z D: int, 2025-05-07T20:32:12.3691867Z scale_ub: Optional[float], 2025-05-07T20:32:12.3692238Z contiguous: bool, 2025-05-07T20:32:12.3692562Z compiled: bool, 2025-05-07T20:32:12.3692859Z ) -> None: 2025-05-07T20:32:12.3693148Z torch.manual_seed(2025) 2025-05-07T20:32:12.3693480Z 2025-05-07T20:32:12.3693872Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.3694390Z 2025-05-07T20:32:12.3694658Z x_sign = torch.sign(x) 2025-05-07T20:32:12.3695041Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.3695444Z x = x_sign * x_clamp 2025-05-07T20:32:12.3695784Z x0 = x[:, :D] 2025-05-07T20:32:12.3696094Z x1 = x[:, D:] 2025-05-07T20:32:12.3696385Z 2025-05-07T20:32:12.3696646Z if contiguous: 2025-05-07T20:32:12.3696967Z x0 = x0.contiguous() 2025-05-07T20:32:12.3697308Z x1 = x1.contiguous() 2025-05-07T20:32:12.3697632Z 2025-05-07T20:32:12.3697894Z if scale_ub is not None: 2025-05-07T20:32:12.3698263Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:12.3698733Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:12.3699165Z ) 2025-05-07T20:32:12.3699426Z else: 2025-05-07T20:32:12.3699709Z scale_ub_tensor = None 2025-05-07T20:32:12.3700190Z 2025-05-07T20:32:12.3700506Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.3700950Z op = silu_mul_quant 2025-05-07T20:32:12.3701309Z if compiled: 2025-05-07T20:32:12.3701670Z op = torch.compile(op) 2025-05-07T20:32:12.3702080Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.3702481Z 2025-05-07T20:32:12.3702947Z y_fp8, y_scale = fn() 2025-05-07T20:32:12.3703336Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:12.3703738Z 2025-05-07T20:32:12.3704063Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.3704495Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:12.3704886Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:12.3705302Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:12.3705774Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:12.3706282Z 2025-05-07T20:32:12.3706557Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:12.3706815Z 2025-05-07T20:32:12.3707093Z moe/activation_test.py:126: 2025-05-07T20:32:12.3707483Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.3707934Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:12.3708395Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:12.3709453Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:12.3710473Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:12.3711189Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:12.3712098Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:12.3713058Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:12.3714084Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:12.3715133Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:12.3716155Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:12.3717152Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:12.3718034Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:12.3718847Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:12.3719568Z fn() 2025-05-07T20:32:12.3720270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:12.3721071Z self.fn.run( 2025-05-07T20:32:12.3721714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:12.3722432Z kernel = self.compile( 2025-05-07T20:32:12.3723130Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:12.3724018Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:12.3724567Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.3724888Z 2025-05-07T20:32:12.3725178Z self = 2025-05-07T20:32:12.3726637Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:12.3728502Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07cfc6f400>} 2025-05-07T20:32:12.3730394Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:12.3731880Z context = 2025-05-07T20:32:12.3732290Z 2025-05-07T20:32:12.3732537Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:12.3733274Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:12.3733951Z module_map=module_map) 2025-05-07T20:32:12.3734453Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:12.3734920Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:12.3735345Z E ^ 2025-05-07T20:32:12.3736061Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:12.3736675Z 2025-05-07T20:32:12.3737240Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:12.3737932Z 2025-05-07T20:32:12.3738069Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.3738657Z self=, 2025-05-07T20:32:12.3739226Z T=2048, 2025-05-07T20:32:12.3739485Z D=5120, 2025-05-07T20:32:12.3739759Z scale_ub=1200.0, 2025-05-07T20:32:12.3740198Z contiguous=True, 2025-05-07T20:32:12.3740500Z compiled=False, 2025-05-07T20:32:12.3740769Z ) 2025-05-07T20:32:12.3741185Z self = 2025-05-07T20:32:12.3741820Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:12.3742183Z 2025-05-07T20:32:12.3742285Z @given( 2025-05-07T20:32:12.3742589Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.3743014Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.3743409Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.3743847Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.3744286Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.3744668Z ) 2025-05-07T20:32:12.3745143Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.3745727Z def test_silu_mul_quant( 2025-05-07T20:32:12.3746057Z self, 2025-05-07T20:32:12.3746306Z T: int, 2025-05-07T20:32:12.3746568Z D: int, 2025-05-07T20:32:12.3746876Z scale_ub: Optional[float], 2025-05-07T20:32:12.3747257Z contiguous: bool, 2025-05-07T20:32:12.3747601Z compiled: bool, 2025-05-07T20:32:12.3747923Z ) -> None: 2025-05-07T20:32:12.3748224Z torch.manual_seed(2025) 2025-05-07T20:32:12.3748571Z 2025-05-07T20:32:12.3748958Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.3749430Z 2025-05-07T20:32:12.3749702Z x_sign = torch.sign(x) 2025-05-07T20:32:12.3750114Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.3750544Z x = x_sign * x_clamp 2025-05-07T20:32:12.3750891Z x0 = x[:, :D] 2025-05-07T20:32:12.3751200Z x1 = x[:, D:] 2025-05-07T20:32:12.3751470Z 2025-05-07T20:32:12.3751722Z if contiguous: 2025-05-07T20:32:12.3752038Z x0 = x0.contiguous() 2025-05-07T20:32:12.3752391Z x1 = x1.contiguous() 2025-05-07T20:32:12.3752709Z 2025-05-07T20:32:12.3752969Z if scale_ub is not None: 2025-05-07T20:32:12.3753335Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:12.3753784Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:12.3754201Z ) 2025-05-07T20:32:12.3754462Z else: 2025-05-07T20:32:12.3754741Z scale_ub_tensor = None 2025-05-07T20:32:12.3755088Z 2025-05-07T20:32:12.3755399Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.3755814Z op = silu_mul_quant 2025-05-07T20:32:12.3756155Z if compiled: 2025-05-07T20:32:12.3756496Z op = torch.compile(op) 2025-05-07T20:32:12.3757025Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.3757409Z 2025-05-07T20:32:12.3757685Z > y_fp8, y_scale = fn() 2025-05-07T20:32:12.3757918Z 2025-05-07T20:32:12.3758063Z moe/activation_test.py:117: 2025-05-07T20:32:12.3758460Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.3758907Z moe/activation_test.py:115: in fn 2025-05-07T20:32:12.3759289Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.3760215Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:12.3761225Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:12.3762059Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:12.3763017Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:12.3763960Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:12.3764713Z kernel = self.compile( 2025-05-07T20:32:12.3765484Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:12.3767747Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:12.3768287Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.3768609Z 2025-05-07T20:32:12.3768896Z self = 2025-05-07T20:32:12.3770396Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:12.3772303Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07cf43eef0>} 2025-05-07T20:32:12.3774172Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:12.3775578Z context = 2025-05-07T20:32:12.3775976Z 2025-05-07T20:32:12.3776218Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:12.3776950Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:12.3777592Z module_map=module_map) 2025-05-07T20:32:12.3778083Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:12.3778570Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:12.3778926Z E ^ 2025-05-07T20:32:12.3779560Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:12.3780296Z 2025-05-07T20:32:12.3780874Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:12.3781590Z 2025-05-07T20:32:12.3781753Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.3782322Z self=, 2025-05-07T20:32:12.3782893Z T=2048, 2025-05-07T20:32:12.3783158Z D=5120, 2025-05-07T20:32:12.3783421Z scale_ub=1200.0, 2025-05-07T20:32:12.3783737Z contiguous=True, 2025-05-07T20:32:12.3784046Z compiled=True, 2025-05-07T20:32:12.3784356Z ) 2025-05-07T20:32:12.3784798Z self = 2025-05-07T20:32:12.3785338Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:12.3785607Z 2025-05-07T20:32:12.3785695Z @given( 2025-05-07T20:32:12.3786030Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.3786349Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.3786665Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.3786993Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.3787325Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.3787612Z ) 2025-05-07T20:32:12.3787969Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.3788408Z def test_silu_mul_quant( 2025-05-07T20:32:12.3788705Z self, 2025-05-07T20:32:12.3788911Z T: int, 2025-05-07T20:32:12.3789110Z D: int, 2025-05-07T20:32:12.3789417Z scale_ub: Optional[float], 2025-05-07T20:32:12.3789701Z contiguous: bool, 2025-05-07T20:32:12.3790306Z compiled: bool, 2025-05-07T20:32:12.3790547Z ) -> None: 2025-05-07T20:32:12.3790767Z torch.manual_seed(2025) 2025-05-07T20:32:12.3791014Z 2025-05-07T20:32:12.3791292Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.3791639Z 2025-05-07T20:32:12.3791837Z x_sign = torch.sign(x) 2025-05-07T20:32:12.3792133Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.3792447Z x = x_sign * x_clamp 2025-05-07T20:32:12.3792697Z x0 = x[:, :D] 2025-05-07T20:32:12.3792912Z x1 = x[:, D:] 2025-05-07T20:32:12.3793125Z 2025-05-07T20:32:12.3793313Z if contiguous: 2025-05-07T20:32:12.3793543Z x0 = x0.contiguous() 2025-05-07T20:32:12.3793807Z x1 = x1.contiguous() 2025-05-07T20:32:12.3794049Z 2025-05-07T20:32:12.3794246Z if scale_ub is not None: 2025-05-07T20:32:12.3794526Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:12.3794865Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:12.3795165Z ) 2025-05-07T20:32:12.3795361Z else: 2025-05-07T20:32:12.3795585Z scale_ub_tensor = None 2025-05-07T20:32:12.3795839Z 2025-05-07T20:32:12.3796082Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.3796403Z op = silu_mul_quant 2025-05-07T20:32:12.3796651Z if compiled: 2025-05-07T20:32:12.3796905Z op = torch.compile(op) 2025-05-07T20:32:12.3797207Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.3797476Z 2025-05-07T20:32:12.3797677Z y_fp8, y_scale = fn() 2025-05-07T20:32:12.3797970Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:12.3798268Z 2025-05-07T20:32:12.3798506Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.3798852Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:12.3799151Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:12.3799469Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:12.3799841Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:12.3800158Z 2025-05-07T20:32:12.3800361Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:12.3800567Z 2025-05-07T20:32:12.3800668Z moe/activation_test.py:126: 2025-05-07T20:32:12.3800972Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.3801307Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:12.3801635Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:12.3802420Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:12.3803175Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:12.3803723Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:12.3804414Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:12.3805104Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:12.3805978Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:12.3806720Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:12.3807470Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:12.3808193Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:12.3808911Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:12.3809617Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:12.3810149Z fn() 2025-05-07T20:32:12.3810658Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:12.3811242Z self.fn.run( 2025-05-07T20:32:12.3811717Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:12.3812247Z kernel = self.compile( 2025-05-07T20:32:12.3812791Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:12.3813438Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:12.3813837Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.3814065Z 2025-05-07T20:32:12.3814288Z self = 2025-05-07T20:32:12.3815369Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:12.3816733Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07bdf05ab0>} 2025-05-07T20:32:12.3818068Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:12.3819097Z context = 2025-05-07T20:32:12.3819385Z 2025-05-07T20:32:12.3819559Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:12.3820193Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:12.3820660Z module_map=module_map) 2025-05-07T20:32:12.3821031Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:12.3821390Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:12.3821656Z E ^ 2025-05-07T20:32:12.3822120Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:12.3822562Z 2025-05-07T20:32:12.3822987Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:12.3823492Z 2025-05-07T20:32:12.3823599Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.3824015Z self=, 2025-05-07T20:32:12.3824421Z T=16384, 2025-05-07T20:32:12.3824618Z D=7168, 2025-05-07T20:32:12.3824808Z scale_ub=1200.0, 2025-05-07T20:32:12.3825042Z contiguous=False, 2025-05-07T20:32:12.3825271Z compiled=False, 2025-05-07T20:32:12.3825473Z ) 2025-05-07T20:32:12.3825793Z self = 2025-05-07T20:32:12.3826294Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:12.3826627Z 2025-05-07T20:32:12.3826706Z @given( 2025-05-07T20:32:12.3826944Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.3827256Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.3827561Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.3827892Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.3828223Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.3828515Z ) 2025-05-07T20:32:12.3828860Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.3829354Z def test_silu_mul_quant( 2025-05-07T20:32:12.3829675Z self, 2025-05-07T20:32:12.3829868Z T: int, 2025-05-07T20:32:12.3830067Z D: int, 2025-05-07T20:32:12.3830290Z scale_ub: Optional[float], 2025-05-07T20:32:12.3830558Z contiguous: bool, 2025-05-07T20:32:12.3830802Z compiled: bool, 2025-05-07T20:32:12.3831032Z ) -> None: 2025-05-07T20:32:12.3831245Z torch.manual_seed(2025) 2025-05-07T20:32:12.3831492Z 2025-05-07T20:32:12.3831766Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.3832099Z 2025-05-07T20:32:12.3832295Z x_sign = torch.sign(x) 2025-05-07T20:32:12.3832589Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.3832902Z x = x_sign * x_clamp 2025-05-07T20:32:12.3833139Z x0 = x[:, :D] 2025-05-07T20:32:12.3833362Z x1 = x[:, D:] 2025-05-07T20:32:12.3833579Z 2025-05-07T20:32:12.3833759Z if contiguous: 2025-05-07T20:32:12.3833996Z x0 = x0.contiguous() 2025-05-07T20:32:12.3834258Z x1 = x1.contiguous() 2025-05-07T20:32:12.3834497Z 2025-05-07T20:32:12.3834694Z if scale_ub is not None: 2025-05-07T20:32:12.3834968Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:12.3835300Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:12.3835613Z ) 2025-05-07T20:32:12.3835810Z else: 2025-05-07T20:32:12.3836021Z scale_ub_tensor = None 2025-05-07T20:32:12.3836277Z 2025-05-07T20:32:12.3836512Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.3836821Z op = silu_mul_quant 2025-05-07T20:32:12.3837075Z if compiled: 2025-05-07T20:32:12.3837330Z op = torch.compile(op) 2025-05-07T20:32:12.3837624Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.3837901Z 2025-05-07T20:32:12.3838103Z > y_fp8, y_scale = fn() 2025-05-07T20:32:12.3838268Z 2025-05-07T20:32:12.3838377Z moe/activation_test.py:117: 2025-05-07T20:32:12.3838674Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.3839014Z moe/activation_test.py:115: in fn 2025-05-07T20:32:12.3839300Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.3839983Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:12.3840674Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:12.3841209Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:12.3841888Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:12.3842544Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:12.3843076Z kernel = self.compile( 2025-05-07T20:32:12.3843626Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:12.3844274Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:12.3844671Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.3844903Z 2025-05-07T20:32:12.3845165Z self = 2025-05-07T20:32:12.3846235Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:12.3847603Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07bdf05870>} 2025-05-07T20:32:12.3849089Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:12.3850106Z context = 2025-05-07T20:32:12.3850397Z 2025-05-07T20:32:12.3850563Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:12.3851082Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:12.3851541Z module_map=module_map) 2025-05-07T20:32:12.3851909Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:12.3852263Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:12.3852520Z E ^ 2025-05-07T20:32:12.3852985Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:12.3853433Z 2025-05-07T20:32:12.3853850Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:12.3854363Z 2025-05-07T20:32:12.3854478Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.3854884Z self=, 2025-05-07T20:32:12.3855283Z T=1, 2025-05-07T20:32:12.3855471Z D=7168, 2025-05-07T20:32:12.3855666Z scale_ub=None, 2025-05-07T20:32:12.3855882Z contiguous=True, 2025-05-07T20:32:12.3856107Z compiled=True, 2025-05-07T20:32:12.3856307Z ) 2025-05-07T20:32:12.3856626Z self = 2025-05-07T20:32:12.3857108Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:12.3857361Z 2025-05-07T20:32:12.3857447Z @given( 2025-05-07T20:32:12.3857675Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.3857989Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.3858299Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.3858627Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.3858959Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.3859243Z ) 2025-05-07T20:32:12.3867521Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.3868002Z def test_silu_mul_quant( 2025-05-07T20:32:12.3868269Z self, 2025-05-07T20:32:12.3868479Z T: int, 2025-05-07T20:32:12.3868679Z D: int, 2025-05-07T20:32:12.3868913Z scale_ub: Optional[float], 2025-05-07T20:32:12.3869195Z contiguous: bool, 2025-05-07T20:32:12.3869439Z compiled: bool, 2025-05-07T20:32:12.3869676Z ) -> None: 2025-05-07T20:32:12.3869903Z torch.manual_seed(2025) 2025-05-07T20:32:12.3870145Z 2025-05-07T20:32:12.3870437Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.3870795Z 2025-05-07T20:32:12.3871001Z x_sign = torch.sign(x) 2025-05-07T20:32:12.3871299Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.3871623Z x = x_sign * x_clamp 2025-05-07T20:32:12.3871877Z x0 = x[:, :D] 2025-05-07T20:32:12.3872095Z x1 = x[:, D:] 2025-05-07T20:32:12.3872311Z 2025-05-07T20:32:12.3872510Z if contiguous: 2025-05-07T20:32:12.3872864Z x0 = x0.contiguous() 2025-05-07T20:32:12.3873134Z x1 = x1.contiguous() 2025-05-07T20:32:12.3873381Z 2025-05-07T20:32:12.3873579Z if scale_ub is not None: 2025-05-07T20:32:12.3873863Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:12.3874209Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:12.3874516Z ) 2025-05-07T20:32:12.3874719Z else: 2025-05-07T20:32:12.3874941Z scale_ub_tensor = None 2025-05-07T20:32:12.3875191Z 2025-05-07T20:32:12.3875431Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.3875805Z op = silu_mul_quant 2025-05-07T20:32:12.3876067Z if compiled: 2025-05-07T20:32:12.3876400Z op = torch.compile(op) 2025-05-07T20:32:12.3876711Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.3876998Z 2025-05-07T20:32:12.3877195Z y_fp8, y_scale = fn() 2025-05-07T20:32:12.3877491Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:12.3877795Z 2025-05-07T20:32:12.3878036Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.3878381Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:12.3878684Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:12.3879002Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:12.3879373Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:12.3879696Z 2025-05-07T20:32:12.3879904Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:12.3880120Z 2025-05-07T20:32:12.3880226Z moe/activation_test.py:126: 2025-05-07T20:32:12.3880543Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.3880890Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:12.3881222Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:12.3882016Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:12.3882779Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:12.3883335Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:12.3884015Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:12.3884762Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:12.3885492Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:12.3886254Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:12.3887006Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:12.3887741Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:12.3888390Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:12.3888994Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:12.3889520Z fn() 2025-05-07T20:32:12.3890399Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:12.3890983Z self.fn.run( 2025-05-07T20:32:12.3891449Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:12.3891996Z kernel = self.compile( 2025-05-07T20:32:12.3892544Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:12.3893191Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:12.3893592Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.3893961Z 2025-05-07T20:32:12.3894178Z self = 2025-05-07T20:32:12.3895251Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:12.3896613Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07bde8d870>} 2025-05-07T20:32:12.3898129Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:12.3899164Z context = 2025-05-07T20:32:12.3899465Z 2025-05-07T20:32:12.3899631Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:12.3900250Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:12.3900712Z module_map=module_map) 2025-05-07T20:32:12.3901082Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:12.3901443Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:12.3901712Z E ^ 2025-05-07T20:32:12.3902177Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:12.3902632Z 2025-05-07T20:32:12.3903054Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:12.3903562Z 2025-05-07T20:32:12.3903674Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.3904080Z self=, 2025-05-07T20:32:12.3904486Z T=4096, 2025-05-07T20:32:12.3904680Z D=5120, 2025-05-07T20:32:12.3904874Z scale_ub=None, 2025-05-07T20:32:12.3905091Z contiguous=False, 2025-05-07T20:32:12.3905322Z compiled=False, 2025-05-07T20:32:12.3905530Z ) 2025-05-07T20:32:12.3905845Z self = 2025-05-07T20:32:12.3906342Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:12.3906612Z 2025-05-07T20:32:12.3906698Z @given( 2025-05-07T20:32:12.3906931Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.3907243Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.3907562Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.3907887Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.3908221Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.3908510Z ) 2025-05-07T20:32:12.3908866Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.3909304Z def test_silu_mul_quant( 2025-05-07T20:32:12.3909550Z self, 2025-05-07T20:32:12.3909751Z T: int, 2025-05-07T20:32:12.3909947Z D: int, 2025-05-07T20:32:12.3910172Z scale_ub: Optional[float], 2025-05-07T20:32:12.3910445Z contiguous: bool, 2025-05-07T20:32:12.3910685Z compiled: bool, 2025-05-07T20:32:12.3910910Z ) -> None: 2025-05-07T20:32:12.3911128Z torch.manual_seed(2025) 2025-05-07T20:32:12.3911363Z 2025-05-07T20:32:12.3911643Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.3911986Z 2025-05-07T20:32:12.3912188Z x_sign = torch.sign(x) 2025-05-07T20:32:12.3912483Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.3912795Z x = x_sign * x_clamp 2025-05-07T20:32:12.3913031Z x0 = x[:, :D] 2025-05-07T20:32:12.3913251Z x1 = x[:, D:] 2025-05-07T20:32:12.3913517Z 2025-05-07T20:32:12.3913705Z if contiguous: 2025-05-07T20:32:12.3913938Z x0 = x0.contiguous() 2025-05-07T20:32:12.3914199Z x1 = x1.contiguous() 2025-05-07T20:32:12.3914441Z 2025-05-07T20:32:12.3914631Z if scale_ub is not None: 2025-05-07T20:32:12.3914908Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:12.3915247Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:12.3915551Z ) 2025-05-07T20:32:12.3915746Z else: 2025-05-07T20:32:12.3916016Z scale_ub_tensor = None 2025-05-07T20:32:12.3916263Z 2025-05-07T20:32:12.3916499Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.3916891Z op = silu_mul_quant 2025-05-07T20:32:12.3917144Z if compiled: 2025-05-07T20:32:12.3917396Z op = torch.compile(op) 2025-05-07T20:32:12.3917695Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.3917966Z 2025-05-07T20:32:12.3918165Z > y_fp8, y_scale = fn() 2025-05-07T20:32:12.3918337Z 2025-05-07T20:32:12.3918437Z moe/activation_test.py:117: 2025-05-07T20:32:12.3918738Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.3919067Z moe/activation_test.py:115: in fn 2025-05-07T20:32:12.3919354Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.3920041Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:12.3920730Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:12.3921272Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:12.3921955Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:12.3922625Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:12.3923154Z kernel = self.compile( 2025-05-07T20:32:12.3923695Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:12.3924349Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:12.3924739Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.3924970Z 2025-05-07T20:32:12.3925177Z self = 2025-05-07T20:32:12.3926259Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:12.3927639Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07bde8eb90>} 2025-05-07T20:32:12.3928979Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:12.3929994Z context = 2025-05-07T20:32:12.3930285Z 2025-05-07T20:32:12.3930454Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:12.3930975Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:12.3931449Z module_map=module_map) 2025-05-07T20:32:12.3931814Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:12.3932170Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:12.3932427Z E ^ 2025-05-07T20:32:12.3932885Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:12.3933387Z 2025-05-07T20:32:12.3933798Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:12.3934312Z 2025-05-07T20:32:12.3934417Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.3934831Z self=, 2025-05-07T20:32:12.3935225Z T=4096, 2025-05-07T20:32:12.3935418Z D=7168, 2025-05-07T20:32:12.3935615Z scale_ub=None, 2025-05-07T20:32:12.3935827Z contiguous=False, 2025-05-07T20:32:12.3936057Z compiled=False, 2025-05-07T20:32:12.3936313Z ) 2025-05-07T20:32:12.3936626Z self = 2025-05-07T20:32:12.3937218Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:12.3937500Z 2025-05-07T20:32:12.3937578Z @given( 2025-05-07T20:32:12.3937815Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.3938127Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.3938442Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.3938773Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.3939095Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.3939383Z ) 2025-05-07T20:32:12.3939733Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.3940227Z def test_silu_mul_quant( 2025-05-07T20:32:12.3940476Z self, 2025-05-07T20:32:12.3940679Z T: int, 2025-05-07T20:32:12.3940881Z D: int, 2025-05-07T20:32:12.3941105Z scale_ub: Optional[float], 2025-05-07T20:32:12.3941383Z contiguous: bool, 2025-05-07T20:32:12.3941624Z compiled: bool, 2025-05-07T20:32:12.3941853Z ) -> None: 2025-05-07T20:32:12.3942082Z torch.manual_seed(2025) 2025-05-07T20:32:12.3942327Z 2025-05-07T20:32:12.3942597Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.3942945Z 2025-05-07T20:32:12.3943150Z x_sign = torch.sign(x) 2025-05-07T20:32:12.3943441Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.3943758Z x = x_sign * x_clamp 2025-05-07T20:32:12.3944004Z x0 = x[:, :D] 2025-05-07T20:32:12.3944251Z x1 = x[:, D:] 2025-05-07T20:32:12.3944478Z 2025-05-07T20:32:12.3944669Z if contiguous: 2025-05-07T20:32:12.3944909Z x0 = x0.contiguous() 2025-05-07T20:32:12.3945165Z x1 = x1.contiguous() 2025-05-07T20:32:12.3945416Z 2025-05-07T20:32:12.3945612Z if scale_ub is not None: 2025-05-07T20:32:12.3945882Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:12.3946222Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:12.3946539Z ) 2025-05-07T20:32:12.3946728Z else: 2025-05-07T20:32:12.3946943Z scale_ub_tensor = None 2025-05-07T20:32:12.3947198Z 2025-05-07T20:32:12.3947428Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.3947747Z op = silu_mul_quant 2025-05-07T20:32:12.3948002Z if compiled: 2025-05-07T20:32:12.3948249Z op = torch.compile(op) 2025-05-07T20:32:12.3948550Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.3948827Z 2025-05-07T20:32:12.3949021Z > y_fp8, y_scale = fn() 2025-05-07T20:32:12.3949190Z 2025-05-07T20:32:12.3949291Z moe/activation_test.py:117: 2025-05-07T20:32:12.3949590Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.3949926Z moe/activation_test.py:115: in fn 2025-05-07T20:32:12.3950207Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.3950899Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:12.3951593Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:12.3952132Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:12.3952865Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:12.3953530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:12.3954060Z kernel = self.compile( 2025-05-07T20:32:12.3954595Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:12.3955253Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:12.3955773Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.3956002Z 2025-05-07T20:32:12.3956220Z self = 2025-05-07T20:32:12.3957290Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:12.3958655Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07bde8eb00>} 2025-05-07T20:32:12.3959991Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:12.3961011Z context = 2025-05-07T20:32:12.3961296Z 2025-05-07T20:32:12.3961480Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:12.3961999Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:12.3962468Z module_map=module_map) 2025-05-07T20:32:12.3962841Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:12.3963190Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:12.3963453Z E ^ 2025-05-07T20:32:12.3963921Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:12.3963926Z 2025-05-07T20:32:12.3964339Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:12.3964344Z 2025-05-07T20:32:12.3964454Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.3964682Z self=, 2025-05-07T20:32:12.3964765Z T=128, 2025-05-07T20:32:12.3964852Z D=7168, 2025-05-07T20:32:12.3964936Z scale_ub=None, 2025-05-07T20:32:12.3965024Z contiguous=False, 2025-05-07T20:32:12.3965115Z compiled=True, 2025-05-07T20:32:12.3965190Z ) 2025-05-07T20:32:12.3965406Z self = 2025-05-07T20:32:12.3965589Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:12.3965593Z 2025-05-07T20:32:12.3965674Z @given( 2025-05-07T20:32:12.3965803Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.3965905Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.3966024Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.3966149Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.3966264Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.3966341Z ) 2025-05-07T20:32:12.3966598Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.3966692Z def test_silu_mul_quant( 2025-05-07T20:32:12.3966778Z self, 2025-05-07T20:32:12.3966858Z T: int, 2025-05-07T20:32:12.3966935Z D: int, 2025-05-07T20:32:12.3967048Z scale_ub: Optional[float], 2025-05-07T20:32:12.3967192Z contiguous: bool, 2025-05-07T20:32:12.3967279Z compiled: bool, 2025-05-07T20:32:12.3967364Z ) -> None: 2025-05-07T20:32:12.3967459Z torch.manual_seed(2025) 2025-05-07T20:32:12.3967532Z 2025-05-07T20:32:12.3967707Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.3967782Z 2025-05-07T20:32:12.3967876Z x_sign = torch.sign(x) 2025-05-07T20:32:12.3968011Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.3968101Z x = x_sign * x_clamp 2025-05-07T20:32:12.3968225Z x0 = x[:, :D] 2025-05-07T20:32:12.3968312Z x1 = x[:, D:] 2025-05-07T20:32:12.3968383Z 2025-05-07T20:32:12.3968547Z if contiguous: 2025-05-07T20:32:12.3968644Z x0 = x0.contiguous() 2025-05-07T20:32:12.3968735Z x1 = x1.contiguous() 2025-05-07T20:32:12.3968816Z 2025-05-07T20:32:12.3968908Z if scale_ub is not None: 2025-05-07T20:32:12.3969020Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:12.3969166Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:12.3969243Z ) 2025-05-07T20:32:12.3969319Z else: 2025-05-07T20:32:12.3969422Z scale_ub_tensor = None 2025-05-07T20:32:12.3969494Z 2025-05-07T20:32:12.3969627Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.3969725Z op = silu_mul_quant 2025-05-07T20:32:12.3969811Z if compiled: 2025-05-07T20:32:12.3969919Z op = torch.compile(op) 2025-05-07T20:32:12.3970034Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.3970105Z 2025-05-07T20:32:12.3970209Z y_fp8, y_scale = fn() 2025-05-07T20:32:12.3970333Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:12.3970406Z 2025-05-07T20:32:12.3970552Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.3970657Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:12.3970762Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:12.3970892Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:12.3971033Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:12.3971106Z 2025-05-07T20:32:12.3971216Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:12.3971220Z 2025-05-07T20:32:12.3971320Z moe/activation_test.py:126: 2025-05-07T20:32:12.3971453Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.3971566Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:12.3971702Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:12.3972272Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:12.3972378Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:12.3972737Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:12.3972968Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:12.3973335Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:12.3973600Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:12.3973997Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:12.3974261Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:12.3974705Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:12.3974876Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:12.3975228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:12.3975354Z fn() 2025-05-07T20:32:12.3975755Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:12.3975850Z self.fn.run( 2025-05-07T20:32:12.3976189Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:12.3976285Z kernel = self.compile( 2025-05-07T20:32:12.3976677Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:12.3976972Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:12.3977109Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.3977114Z 2025-05-07T20:32:12.3977322Z self = 2025-05-07T20:32:12.3978099Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:12.3978614Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07bde8fac0>} 2025-05-07T20:32:12.3979353Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:12.3979556Z context = 2025-05-07T20:32:12.3979561Z 2025-05-07T20:32:12.3979726Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:12.3980062Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:12.3980173Z module_map=module_map) 2025-05-07T20:32:12.3980343Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:12.3980453Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:12.3980532Z E ^ 2025-05-07T20:32:12.3980887Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:12.3980892Z 2025-05-07T20:32:12.3981309Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:12.3981317Z 2025-05-07T20:32:12.3981420Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.3981650Z self=, 2025-05-07T20:32:12.3981729Z T=128, 2025-05-07T20:32:12.3981806Z D=7168, 2025-05-07T20:32:12.3981895Z scale_ub=None, 2025-05-07T20:32:12.3981984Z contiguous=False, 2025-05-07T20:32:12.3982072Z compiled=False, 2025-05-07T20:32:12.3982149Z ) 2025-05-07T20:32:12.3982366Z self = 2025-05-07T20:32:12.3982540Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:12.3982552Z 2025-05-07T20:32:12.3982633Z @given( 2025-05-07T20:32:12.3982757Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.3982863Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.3982981Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.3983104Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.3983228Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.3983308Z ) 2025-05-07T20:32:12.3983554Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.3983658Z def test_silu_mul_quant( 2025-05-07T20:32:12.3983739Z self, 2025-05-07T20:32:12.3983815Z T: int, 2025-05-07T20:32:12.3983947Z D: int, 2025-05-07T20:32:12.3984049Z scale_ub: Optional[float], 2025-05-07T20:32:12.3984147Z contiguous: bool, 2025-05-07T20:32:12.3984234Z compiled: bool, 2025-05-07T20:32:12.3984312Z ) -> None: 2025-05-07T20:32:12.3984416Z torch.manual_seed(2025) 2025-05-07T20:32:12.3984487Z 2025-05-07T20:32:12.3984655Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.3984736Z 2025-05-07T20:32:12.3984833Z x_sign = torch.sign(x) 2025-05-07T20:32:12.3985026Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.3985123Z x = x_sign * x_clamp 2025-05-07T20:32:12.3985281Z x0 = x[:, :D] 2025-05-07T20:32:12.3985364Z x1 = x[:, D:] 2025-05-07T20:32:12.3985444Z 2025-05-07T20:32:12.3985528Z if contiguous: 2025-05-07T20:32:12.3985629Z x0 = x0.contiguous() 2025-05-07T20:32:12.3985719Z x1 = x1.contiguous() 2025-05-07T20:32:12.3985796Z 2025-05-07T20:32:12.3985895Z if scale_ub is not None: 2025-05-07T20:32:12.3986001Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:12.3986137Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:12.3986222Z ) 2025-05-07T20:32:12.3986300Z else: 2025-05-07T20:32:12.3986399Z scale_ub_tensor = None 2025-05-07T20:32:12.3986477Z 2025-05-07T20:32:12.3986608Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.3986700Z op = silu_mul_quant 2025-05-07T20:32:12.3986797Z if compiled: 2025-05-07T20:32:12.3986899Z op = torch.compile(op) 2025-05-07T20:32:12.3987019Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.3987090Z 2025-05-07T20:32:12.3987185Z > y_fp8, y_scale = fn() 2025-05-07T20:32:12.3987189Z 2025-05-07T20:32:12.3987294Z moe/activation_test.py:117: 2025-05-07T20:32:12.3987428Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.3987533Z moe/activation_test.py:115: in fn 2025-05-07T20:32:12.3987642Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.3988139Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:12.3988240Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:12.3988602Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:12.3988824Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:12.3989179Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:12.3989278Z kernel = self.compile( 2025-05-07T20:32:12.3989665Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:12.3990118Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:12.3990301Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.3990308Z 2025-05-07T20:32:12.3990553Z self = 2025-05-07T20:32:12.3991334Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:12.3991851Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07bda38f70>} 2025-05-07T20:32:12.3992602Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:12.3992911Z context = 2025-05-07T20:32:12.3992916Z 2025-05-07T20:32:12.3993092Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:12.3993353Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:12.3993464Z module_map=module_map) 2025-05-07T20:32:12.3993633Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:12.3993731Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:12.3993885Z E ^ 2025-05-07T20:32:12.3994347Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:12.3994352Z 2025-05-07T20:32:12.3994774Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:12.3994779Z 2025-05-07T20:32:12.3994896Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.3995119Z self=, 2025-05-07T20:32:12.3995207Z T=4096, 2025-05-07T20:32:12.3995289Z D=5120, 2025-05-07T20:32:12.3995374Z scale_ub=1200.0, 2025-05-07T20:32:12.3995468Z contiguous=True, 2025-05-07T20:32:12.3995553Z compiled=False, 2025-05-07T20:32:12.3995624Z ) 2025-05-07T20:32:12.3995847Z self = 2025-05-07T20:32:12.3996023Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:12.3996031Z 2025-05-07T20:32:12.3996108Z @given( 2025-05-07T20:32:12.3996242Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.3996342Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.3996465Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.3996585Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.3996702Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.3996780Z ) 2025-05-07T20:32:12.3997026Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.3997120Z def test_silu_mul_quant( 2025-05-07T20:32:12.3997205Z self, 2025-05-07T20:32:12.3997282Z T: int, 2025-05-07T20:32:12.3997359Z D: int, 2025-05-07T20:32:12.3997469Z scale_ub: Optional[float], 2025-05-07T20:32:12.3997565Z contiguous: bool, 2025-05-07T20:32:12.3997653Z compiled: bool, 2025-05-07T20:32:12.3997741Z ) -> None: 2025-05-07T20:32:12.3997841Z torch.manual_seed(2025) 2025-05-07T20:32:12.3997912Z 2025-05-07T20:32:12.3998093Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.3998169Z 2025-05-07T20:32:12.3998269Z x_sign = torch.sign(x) 2025-05-07T20:32:12.3998396Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.3998491Z x = x_sign * x_clamp 2025-05-07T20:32:12.3998577Z x0 = x[:, :D] 2025-05-07T20:32:12.3998659Z x1 = x[:, D:] 2025-05-07T20:32:12.3998733Z 2025-05-07T20:32:12.3998823Z if contiguous: 2025-05-07T20:32:12.3998917Z x0 = x0.contiguous() 2025-05-07T20:32:12.3999008Z x1 = x1.contiguous() 2025-05-07T20:32:12.3999086Z 2025-05-07T20:32:12.3999177Z if scale_ub is not None: 2025-05-07T20:32:12.3999287Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:12.3999429Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:12.3999509Z ) 2025-05-07T20:32:12.3999598Z else: 2025-05-07T20:32:12.3999702Z scale_ub_tensor = None 2025-05-07T20:32:12.3999777Z 2025-05-07T20:32:12.3999914Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.4000006Z op = silu_mul_quant 2025-05-07T20:32:12.4000092Z if compiled: 2025-05-07T20:32:12.4000202Z op = torch.compile(op) 2025-05-07T20:32:12.4000364Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.4000439Z 2025-05-07T20:32:12.4000538Z > y_fp8, y_scale = fn() 2025-05-07T20:32:12.4000542Z 2025-05-07T20:32:12.4000642Z moe/activation_test.py:117: 2025-05-07T20:32:12.4000778Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.4000881Z moe/activation_test.py:115: in fn 2025-05-07T20:32:12.4000987Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.4001489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:12.4001706Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:12.4002072Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:12.4002309Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:12.4002656Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:12.4002763Z kernel = self.compile( 2025-05-07T20:32:12.4003148Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:12.4003326Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:12.4003460Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.4003465Z 2025-05-07T20:32:12.4003675Z self = 2025-05-07T20:32:12.4004457Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:12.4004966Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07bda39510>} 2025-05-07T20:32:12.4005721Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:12.4005917Z context = 2025-05-07T20:32:12.4005922Z 2025-05-07T20:32:12.4006092Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:12.4006369Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:12.4006486Z module_map=module_map) 2025-05-07T20:32:12.4006650Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:12.4006757Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:12.4006835Z E ^ 2025-05-07T20:32:12.4007190Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:12.4007207Z 2025-05-07T20:32:12.4007620Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:12.4007625Z 2025-05-07T20:32:12.4007730Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.4007956Z self=, 2025-05-07T20:32:12.4008035Z T=1, 2025-05-07T20:32:12.4008112Z D=5120, 2025-05-07T20:32:12.4008206Z scale_ub=None, 2025-05-07T20:32:12.4008291Z contiguous=True, 2025-05-07T20:32:12.4008374Z compiled=True, 2025-05-07T20:32:12.4008469Z ) 2025-05-07T20:32:12.4008693Z self = 2025-05-07T20:32:12.4008860Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:12.4008864Z 2025-05-07T20:32:12.4014806Z @given( 2025-05-07T20:32:12.4015041Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.4015147Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.4015277Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.4015401Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.4015523Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.4015611Z ) 2025-05-07T20:32:12.4015862Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.4015960Z def test_silu_mul_quant( 2025-05-07T20:32:12.4016094Z self, 2025-05-07T20:32:12.4016177Z T: int, 2025-05-07T20:32:12.4016260Z D: int, 2025-05-07T20:32:12.4016929Z scale_ub: Optional[float], 2025-05-07T20:32:12.4017028Z contiguous: bool, 2025-05-07T20:32:12.4017127Z compiled: bool, 2025-05-07T20:32:12.4017213Z ) -> None: 2025-05-07T20:32:12.4017314Z torch.manual_seed(2025) 2025-05-07T20:32:12.4017408Z 2025-05-07T20:32:12.4017586Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.4017662Z 2025-05-07T20:32:12.4017764Z x_sign = torch.sign(x) 2025-05-07T20:32:12.4017893Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.4017988Z x = x_sign * x_clamp 2025-05-07T20:32:12.4018077Z x0 = x[:, :D] 2025-05-07T20:32:12.4018161Z x1 = x[:, D:] 2025-05-07T20:32:12.4018238Z 2025-05-07T20:32:12.4018332Z if contiguous: 2025-05-07T20:32:12.4018429Z x0 = x0.contiguous() 2025-05-07T20:32:12.4018534Z x1 = x1.contiguous() 2025-05-07T20:32:12.4018614Z 2025-05-07T20:32:12.4018715Z if scale_ub is not None: 2025-05-07T20:32:12.4018833Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:12.4018975Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:12.4019056Z ) 2025-05-07T20:32:12.4019146Z else: 2025-05-07T20:32:12.4019249Z scale_ub_tensor = None 2025-05-07T20:32:12.4019327Z 2025-05-07T20:32:12.4019474Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.4019572Z op = silu_mul_quant 2025-05-07T20:32:12.4019664Z if compiled: 2025-05-07T20:32:12.4019872Z op = torch.compile(op) 2025-05-07T20:32:12.4019986Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.4020074Z 2025-05-07T20:32:12.4020173Z y_fp8, y_scale = fn() 2025-05-07T20:32:12.4020303Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:12.4020394Z 2025-05-07T20:32:12.4020534Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.4020652Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:12.4020770Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:12.4020900Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:12.4021048Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:12.4021138Z 2025-05-07T20:32:12.4021244Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:12.4021249Z 2025-05-07T20:32:12.4021370Z moe/activation_test.py:126: 2025-05-07T20:32:12.4021505Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.4021616Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:12.4021768Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:12.4022336Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:12.4022448Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:12.4022828Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:12.4023055Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:12.4023512Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:12.4023775Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:12.4024185Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:12.4024449Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:12.4024826Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:12.4025121Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:12.4025473Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:12.4025555Z fn() 2025-05-07T20:32:12.4025965Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:12.4026055Z self.fn.run( 2025-05-07T20:32:12.4026402Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:12.4026512Z kernel = self.compile( 2025-05-07T20:32:12.4026898Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:12.4027086Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:12.4027219Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.4027226Z 2025-05-07T20:32:12.4027444Z self = 2025-05-07T20:32:12.4028231Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:12.4028739Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07bdf070a0>} 2025-05-07T20:32:12.4029493Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:12.4029689Z context = 2025-05-07T20:32:12.4029697Z 2025-05-07T20:32:12.4029865Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:12.4030147Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:12.4030260Z module_map=module_map) 2025-05-07T20:32:12.4030440Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:12.4030549Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:12.4030630Z E ^ 2025-05-07T20:32:12.4031000Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:12.4031004Z 2025-05-07T20:32:12.4031422Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:12.4031427Z 2025-05-07T20:32:12.4031544Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.4031770Z self=, 2025-05-07T20:32:12.4031855Z T=2048, 2025-05-07T20:32:12.4031944Z D=5120, 2025-05-07T20:32:12.4032032Z scale_ub=None, 2025-05-07T20:32:12.4032129Z contiguous=True, 2025-05-07T20:32:12.4032225Z compiled=True, 2025-05-07T20:32:12.4032303Z ) 2025-05-07T20:32:12.4032523Z self = 2025-05-07T20:32:12.4032706Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:12.4032758Z 2025-05-07T20:32:12.4032840Z @given( 2025-05-07T20:32:12.4032973Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.4033080Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.4033200Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.4033331Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.4033450Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.4033529Z ) 2025-05-07T20:32:12.4033787Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.4033931Z def test_silu_mul_quant( 2025-05-07T20:32:12.4034085Z self, 2025-05-07T20:32:12.4034178Z T: int, 2025-05-07T20:32:12.4034260Z D: int, 2025-05-07T20:32:12.4034365Z scale_ub: Optional[float], 2025-05-07T20:32:12.4034465Z contiguous: bool, 2025-05-07T20:32:12.4034555Z compiled: bool, 2025-05-07T20:32:12.4034647Z ) -> None: 2025-05-07T20:32:12.4034747Z torch.manual_seed(2025) 2025-05-07T20:32:12.4034824Z 2025-05-07T20:32:12.4035004Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.4035081Z 2025-05-07T20:32:12.4035176Z x_sign = torch.sign(x) 2025-05-07T20:32:12.4035312Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.4035406Z x = x_sign * x_clamp 2025-05-07T20:32:12.4035490Z x0 = x[:, :D] 2025-05-07T20:32:12.4035581Z x1 = x[:, D:] 2025-05-07T20:32:12.4035661Z 2025-05-07T20:32:12.4035751Z if contiguous: 2025-05-07T20:32:12.4035855Z x0 = x0.contiguous() 2025-05-07T20:32:12.4035958Z x1 = x1.contiguous() 2025-05-07T20:32:12.4036043Z 2025-05-07T20:32:12.4036139Z if scale_ub is not None: 2025-05-07T20:32:12.4036249Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:12.4036398Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:12.4036482Z ) 2025-05-07T20:32:12.4036566Z else: 2025-05-07T20:32:12.4036674Z scale_ub_tensor = None 2025-05-07T20:32:12.4036752Z 2025-05-07T20:32:12.4036888Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.4036987Z op = silu_mul_quant 2025-05-07T20:32:12.4037076Z if compiled: 2025-05-07T20:32:12.4037181Z op = torch.compile(op) 2025-05-07T20:32:12.4037299Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.4037379Z 2025-05-07T20:32:12.4037474Z y_fp8, y_scale = fn() 2025-05-07T20:32:12.4037606Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:12.4037688Z 2025-05-07T20:32:12.4037830Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.4037942Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:12.4038046Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:12.4038181Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:12.4038324Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:12.4038401Z 2025-05-07T20:32:12.4038511Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:12.4038516Z 2025-05-07T20:32:12.4038621Z moe/activation_test.py:126: 2025-05-07T20:32:12.4038753Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.4038872Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:12.4039009Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:12.4039580Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:12.4039688Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:12.4040048Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:12.4040330Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:12.4040702Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:12.4040961Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:12.4041369Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:12.4041627Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:12.4042125Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:12.4042296Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:12.4042639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:12.4042728Z fn() 2025-05-07T20:32:12.4043136Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:12.4043230Z self.fn.run( 2025-05-07T20:32:12.4043571Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:12.4043670Z kernel = self.compile( 2025-05-07T20:32:12.4044056Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:12.4044235Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:12.4044372Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.4044377Z 2025-05-07T20:32:12.4044591Z self = 2025-05-07T20:32:12.4045382Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:12.4045898Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07bd48a7a0>} 2025-05-07T20:32:12.4046642Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:12.4046845Z context = 2025-05-07T20:32:12.4046849Z 2025-05-07T20:32:12.4047023Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:12.4047287Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:12.4047405Z module_map=module_map) 2025-05-07T20:32:12.4047572Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:12.4047678Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:12.4047766Z E ^ 2025-05-07T20:32:12.4048122Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:12.4048126Z 2025-05-07T20:32:12.4048548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:12.4048553Z 2025-05-07T20:32:12.4048660Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.4048886Z self=, 2025-05-07T20:32:12.4048978Z T=128, 2025-05-07T20:32:12.4049059Z D=5120, 2025-05-07T20:32:12.4049147Z scale_ub=None, 2025-05-07T20:32:12.4049241Z contiguous=True, 2025-05-07T20:32:12.4049328Z compiled=True, 2025-05-07T20:32:12.4049412Z ) 2025-05-07T20:32:12.4049636Z self = 2025-05-07T20:32:12.4049858Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:12.4049863Z 2025-05-07T20:32:12.4049946Z @given( 2025-05-07T20:32:12.4050069Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.4050170Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.4050295Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.4050415Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.4050533Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.4050660Z ) 2025-05-07T20:32:12.4051010Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.4051117Z def test_silu_mul_quant( 2025-05-07T20:32:12.4051197Z self, 2025-05-07T20:32:12.4051278Z T: int, 2025-05-07T20:32:12.4051364Z D: int, 2025-05-07T20:32:12.4051469Z scale_ub: Optional[float], 2025-05-07T20:32:12.4051566Z contiguous: bool, 2025-05-07T20:32:12.4051662Z compiled: bool, 2025-05-07T20:32:12.4051745Z ) -> None: 2025-05-07T20:32:12.4051844Z torch.manual_seed(2025) 2025-05-07T20:32:12.4051926Z 2025-05-07T20:32:12.4052096Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.4052173Z 2025-05-07T20:32:12.4052280Z x_sign = torch.sign(x) 2025-05-07T20:32:12.4052406Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.4052505Z x = x_sign * x_clamp 2025-05-07T20:32:12.4052594Z x0 = x[:, :D] 2025-05-07T20:32:12.4052678Z x1 = x[:, D:] 2025-05-07T20:32:12.4052761Z 2025-05-07T20:32:12.4052854Z if contiguous: 2025-05-07T20:32:12.4052951Z x0 = x0.contiguous() 2025-05-07T20:32:12.4053050Z x1 = x1.contiguous() 2025-05-07T20:32:12.4053126Z 2025-05-07T20:32:12.4053220Z if scale_ub is not None: 2025-05-07T20:32:12.4053337Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:12.4053477Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:12.4053556Z ) 2025-05-07T20:32:12.4053643Z else: 2025-05-07T20:32:12.4053742Z scale_ub_tensor = None 2025-05-07T20:32:12.4053828Z 2025-05-07T20:32:12.4053962Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.4054055Z op = silu_mul_quant 2025-05-07T20:32:12.4054151Z if compiled: 2025-05-07T20:32:12.4054259Z op = torch.compile(op) 2025-05-07T20:32:12.4054373Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.4054463Z 2025-05-07T20:32:12.4054564Z y_fp8, y_scale = fn() 2025-05-07T20:32:12.4054689Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:12.4054774Z 2025-05-07T20:32:12.4054915Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.4055020Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:12.4055134Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:12.4055259Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:12.4055407Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:12.4055486Z 2025-05-07T20:32:12.4055589Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:12.4055594Z 2025-05-07T20:32:12.4055704Z moe/activation_test.py:126: 2025-05-07T20:32:12.4055836Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.4055950Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:12.4056099Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:12.4056661Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:12.4056776Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:12.4057137Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:12.4057415Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:12.4057797Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:12.4058056Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:12.4058453Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:12.4058832Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:12.4059215Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:12.4059393Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:12.4059736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:12.4059930Z fn() 2025-05-07T20:32:12.4060337Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:12.4060424Z self.fn.run( 2025-05-07T20:32:12.4060768Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:12.4060864Z kernel = self.compile( 2025-05-07T20:32:12.4061242Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:12.4061436Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:12.4061564Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.4061569Z 2025-05-07T20:32:12.4061776Z self = 2025-05-07T20:32:12.4062557Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:12.4063056Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07bd48bc70>} 2025-05-07T20:32:12.4063801Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:12.4063999Z context = 2025-05-07T20:32:12.4064003Z 2025-05-07T20:32:12.4064176Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:12.4064439Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:12.4064554Z module_map=module_map) 2025-05-07T20:32:12.4064725Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:12.4064829Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:12.4064906Z E ^ 2025-05-07T20:32:12.4065265Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:12.4065270Z 2025-05-07T20:32:12.4065681Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:12.4065687Z 2025-05-07T20:32:12.4065799Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.4066029Z self=, 2025-05-07T20:32:12.4066109Z T=4096, 2025-05-07T20:32:12.4066189Z D=5120, 2025-05-07T20:32:12.4066272Z scale_ub=None, 2025-05-07T20:32:12.4066358Z contiguous=True, 2025-05-07T20:32:12.4066542Z compiled=True, 2025-05-07T20:32:12.4066615Z ) 2025-05-07T20:32:12.4066836Z self = 2025-05-07T20:32:12.4067008Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:12.4067013Z 2025-05-07T20:32:12.4067090Z @given( 2025-05-07T20:32:12.4067216Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.4067318Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.4067437Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.4067608Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.4067726Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.4067902Z ) 2025-05-07T20:32:12.4068156Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.4068252Z def test_silu_mul_quant( 2025-05-07T20:32:12.4068338Z self, 2025-05-07T20:32:12.4068418Z T: int, 2025-05-07T20:32:12.4068496Z D: int, 2025-05-07T20:32:12.4068604Z scale_ub: Optional[float], 2025-05-07T20:32:12.4068694Z contiguous: bool, 2025-05-07T20:32:12.4068781Z compiled: bool, 2025-05-07T20:32:12.4068864Z ) -> None: 2025-05-07T20:32:12.4068959Z torch.manual_seed(2025) 2025-05-07T20:32:12.4069031Z 2025-05-07T20:32:12.4069205Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.4069281Z 2025-05-07T20:32:12.4069376Z x_sign = torch.sign(x) 2025-05-07T20:32:12.4069515Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.4069607Z x = x_sign * x_clamp 2025-05-07T20:32:12.4069700Z x0 = x[:, :D] 2025-05-07T20:32:12.4069783Z x1 = x[:, D:] 2025-05-07T20:32:12.4069859Z 2025-05-07T20:32:12.4069949Z if contiguous: 2025-05-07T20:32:12.4070043Z x0 = x0.contiguous() 2025-05-07T20:32:12.4070135Z x1 = x1.contiguous() 2025-05-07T20:32:12.4070221Z 2025-05-07T20:32:12.4070312Z if scale_ub is not None: 2025-05-07T20:32:12.4070421Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:12.4070561Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:12.4070639Z ) 2025-05-07T20:32:12.4070716Z else: 2025-05-07T20:32:12.4070820Z scale_ub_tensor = None 2025-05-07T20:32:12.4070894Z 2025-05-07T20:32:12.4071026Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.4071125Z op = silu_mul_quant 2025-05-07T20:32:12.4071214Z if compiled: 2025-05-07T20:32:12.4071322Z op = torch.compile(op) 2025-05-07T20:32:12.4071439Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.4071512Z 2025-05-07T20:32:12.4071615Z y_fp8, y_scale = fn() 2025-05-07T20:32:12.4071738Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:12.4071812Z 2025-05-07T20:32:12.4071958Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.4072062Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:12.4072165Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:12.4072297Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:12.4072441Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:12.4072522Z 2025-05-07T20:32:12.4072626Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:12.4072631Z 2025-05-07T20:32:12.4072732Z moe/activation_test.py:126: 2025-05-07T20:32:12.4072872Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.4072985Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:12.4073122Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:12.4073683Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:12.4073839Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:12.4074212Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:12.4074435Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:12.4074800Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:12.4075062Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:12.4075508Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:12.4075839Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:12.4076221Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:12.4076393Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:12.4076738Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:12.4076817Z fn() 2025-05-07T20:32:12.4077217Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:12.4077307Z self.fn.run( 2025-05-07T20:32:12.4077642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:12.4077761Z kernel = self.compile( 2025-05-07T20:32:12.4078154Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:12.4078337Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:12.4078466Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.4078470Z 2025-05-07T20:32:12.4078683Z self = 2025-05-07T20:32:12.4079477Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:12.4079975Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07bcfa4940>} 2025-05-07T20:32:12.4080744Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:12.4080940Z context = 2025-05-07T20:32:12.4080945Z 2025-05-07T20:32:12.4081110Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:12.4081384Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:12.4081494Z module_map=module_map) 2025-05-07T20:32:12.4081668Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:12.4081772Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:12.4081850Z E ^ 2025-05-07T20:32:12.4082208Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:12.4082216Z 2025-05-07T20:32:12.4082632Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:12.4082637Z 2025-05-07T20:32:12.4082749Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.4082969Z self=, 2025-05-07T20:32:12.4083050Z T=16384, 2025-05-07T20:32:12.4083181Z D=5120, 2025-05-07T20:32:12.4083266Z scale_ub=None, 2025-05-07T20:32:12.4083353Z contiguous=True, 2025-05-07T20:32:12.4083446Z compiled=True, 2025-05-07T20:32:12.4083519Z ) 2025-05-07T20:32:12.4083735Z self = 2025-05-07T20:32:12.4083917Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:12.4083921Z 2025-05-07T20:32:12.4084000Z @given( 2025-05-07T20:32:12.4084127Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.4084279Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.4084398Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.4084627Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.4084745Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.4084820Z ) 2025-05-07T20:32:12.4085072Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.4085171Z def test_silu_mul_quant( 2025-05-07T20:32:12.4085250Z self, 2025-05-07T20:32:12.4085337Z T: int, 2025-05-07T20:32:12.4085414Z D: int, 2025-05-07T20:32:12.4085514Z scale_ub: Optional[float], 2025-05-07T20:32:12.4085610Z contiguous: bool, 2025-05-07T20:32:12.4085697Z compiled: bool, 2025-05-07T20:32:12.4085783Z ) -> None: 2025-05-07T20:32:12.4085878Z torch.manual_seed(2025) 2025-05-07T20:32:12.4085951Z 2025-05-07T20:32:12.4086124Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.4086207Z 2025-05-07T20:32:12.4086305Z x_sign = torch.sign(x) 2025-05-07T20:32:12.4086442Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.4086532Z x = x_sign * x_clamp 2025-05-07T20:32:12.4086617Z x0 = x[:, :D] 2025-05-07T20:32:12.4086706Z x1 = x[:, D:] 2025-05-07T20:32:12.4086778Z 2025-05-07T20:32:12.4086862Z if contiguous: 2025-05-07T20:32:12.4086965Z x0 = x0.contiguous() 2025-05-07T20:32:12.4087055Z x1 = x1.contiguous() 2025-05-07T20:32:12.4087126Z 2025-05-07T20:32:12.4087226Z if scale_ub is not None: 2025-05-07T20:32:12.4087336Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:12.4087478Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:12.4087557Z ) 2025-05-07T20:32:12.4087634Z else: 2025-05-07T20:32:12.4087737Z scale_ub_tensor = None 2025-05-07T20:32:12.4087811Z 2025-05-07T20:32:12.4087947Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.4088045Z op = silu_mul_quant 2025-05-07T20:32:12.4088139Z if compiled: 2025-05-07T20:32:12.4088243Z op = torch.compile(op) 2025-05-07T20:32:12.4088358Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.4088432Z 2025-05-07T20:32:12.4088524Z y_fp8, y_scale = fn() 2025-05-07T20:32:12.4088656Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:12.4088729Z 2025-05-07T20:32:12.4088875Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.4088976Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:12.4089078Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:12.4089209Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:12.4089350Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:12.4089425Z 2025-05-07T20:32:12.4089531Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:12.4089539Z 2025-05-07T20:32:12.4089640Z moe/activation_test.py:126: 2025-05-07T20:32:12.4089783Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.4090186Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:12.4090386Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:12.4090984Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:12.4091247Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:12.4091604Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:12.4091836Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:12.4092200Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:12.4092541Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:12.4093052Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:12.4093306Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:12.4093696Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:12.4093868Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:12.4094217Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:12.4094298Z fn() 2025-05-07T20:32:12.4094698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:12.4094787Z self.fn.run( 2025-05-07T20:32:12.4095134Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:12.4095235Z kernel = self.compile( 2025-05-07T20:32:12.4095620Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:12.4095797Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:12.4095941Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.4095946Z 2025-05-07T20:32:12.4096153Z self = 2025-05-07T20:32:12.4096925Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:12.4097430Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07bd48a9e0>} 2025-05-07T20:32:12.4098180Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:12.4098380Z context = 2025-05-07T20:32:12.4098386Z 2025-05-07T20:32:12.4098551Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:12.4098814Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:12.4098930Z module_map=module_map) 2025-05-07T20:32:12.4099094Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:12.4099207Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:12.4099286Z E ^ 2025-05-07T20:32:12.4099639Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:12.4099646Z 2025-05-07T20:32:12.4100179Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:12.4100184Z 2025-05-07T20:32:12.4100288Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.4100517Z self=, 2025-05-07T20:32:12.4100648Z T=1, 2025-05-07T20:32:12.4100725Z D=5120, 2025-05-07T20:32:12.4100817Z scale_ub=1200.0, 2025-05-07T20:32:12.4100905Z contiguous=True, 2025-05-07T20:32:12.4100989Z compiled=True, 2025-05-07T20:32:12.4101068Z ) 2025-05-07T20:32:12.4101291Z self = 2025-05-07T20:32:12.4101457Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:12.4101462Z 2025-05-07T20:32:12.4101547Z @given( 2025-05-07T20:32:12.4101710Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.4101815Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.4102007Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.4102128Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.4102248Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.4102323Z ) 2025-05-07T20:32:12.4102576Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.4102679Z def test_silu_mul_quant( 2025-05-07T20:32:12.4102757Z self, 2025-05-07T20:32:12.4102833Z T: int, 2025-05-07T20:32:12.4102917Z D: int, 2025-05-07T20:32:12.4103016Z scale_ub: Optional[float], 2025-05-07T20:32:12.4103107Z contiguous: bool, 2025-05-07T20:32:12.4103199Z compiled: bool, 2025-05-07T20:32:12.4103277Z ) -> None: 2025-05-07T20:32:12.4103378Z torch.manual_seed(2025) 2025-05-07T20:32:12.4103458Z 2025-05-07T20:32:12.4103630Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.4103709Z 2025-05-07T20:32:12.4103812Z x_sign = torch.sign(x) 2025-05-07T20:32:12.4103938Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.4104034Z x = x_sign * x_clamp 2025-05-07T20:32:12.4104113Z x0 = x[:, :D] 2025-05-07T20:32:12.4104199Z x1 = x[:, D:] 2025-05-07T20:32:12.4104277Z 2025-05-07T20:32:12.4104360Z if contiguous: 2025-05-07T20:32:12.4104454Z x0 = x0.contiguous() 2025-05-07T20:32:12.4104554Z x1 = x1.contiguous() 2025-05-07T20:32:12.4104625Z 2025-05-07T20:32:12.4104716Z if scale_ub is not None: 2025-05-07T20:32:12.4104829Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:12.4104966Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:12.4105047Z ) 2025-05-07T20:32:12.4105125Z else: 2025-05-07T20:32:12.4105226Z scale_ub_tensor = None 2025-05-07T20:32:12.4105304Z 2025-05-07T20:32:12.4105441Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.4105535Z op = silu_mul_quant 2025-05-07T20:32:12.4105624Z if compiled: 2025-05-07T20:32:12.4105724Z op = torch.compile(op) 2025-05-07T20:32:12.4105834Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.4105913Z 2025-05-07T20:32:12.4106005Z > y_fp8, y_scale = fn() 2025-05-07T20:32:12.4106010Z 2025-05-07T20:32:12.4106116Z moe/activation_test.py:117: 2025-05-07T20:32:12.4106244Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.4106347Z moe/activation_test.py:115: in fn 2025-05-07T20:32:12.4106453Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.4106821Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:12.4106921Z return fn(*args, **kwargs) 2025-05-07T20:32:12.4107424Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:12.4107525Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:12.4107891Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:12.4108165Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:12.4108512Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:12.4108613Z kernel = self.compile( 2025-05-07T20:32:12.4108995Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:12.4109174Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:12.4109306Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.4109352Z 2025-05-07T20:32:12.4109631Z self = 2025-05-07T20:32:12.4110411Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:12.4110921Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07bcfa68c0>} 2025-05-07T20:32:12.4111669Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:12.4111860Z context = 2025-05-07T20:32:12.4111868Z 2025-05-07T20:32:12.4112035Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:12.4112313Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:12.4112423Z module_map=module_map) 2025-05-07T20:32:12.4112587Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:12.4112694Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:12.4112775Z E ^ 2025-05-07T20:32:12.4113133Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:12.4113138Z 2025-05-07T20:32:12.4113549Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:12.4113554Z 2025-05-07T20:32:12.4113660Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.4113888Z self=, 2025-05-07T20:32:12.4113969Z T=1, 2025-05-07T20:32:12.4114057Z D=5120, 2025-05-07T20:32:12.4114144Z scale_ub=None, 2025-05-07T20:32:12.4114238Z contiguous=False, 2025-05-07T20:32:12.4114327Z compiled=True, 2025-05-07T20:32:12.4114401Z ) 2025-05-07T20:32:12.4114622Z self = 2025-05-07T20:32:12.4114796Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:12.4114803Z 2025-05-07T20:32:12.4114881Z @given( 2025-05-07T20:32:12.4115002Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.4115112Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.4115234Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.4115361Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.4115478Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.4115553Z ) 2025-05-07T20:32:12.4115803Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.4115901Z def test_silu_mul_quant( 2025-05-07T20:32:12.4115978Z self, 2025-05-07T20:32:12.4116072Z T: int, 2025-05-07T20:32:12.4116150Z D: int, 2025-05-07T20:32:12.4116252Z scale_ub: Optional[float], 2025-05-07T20:32:12.4116348Z contiguous: bool, 2025-05-07T20:32:12.4116433Z compiled: bool, 2025-05-07T20:32:12.4116566Z ) -> None: 2025-05-07T20:32:12.4116668Z torch.manual_seed(2025) 2025-05-07T20:32:12.4116738Z 2025-05-07T20:32:12.4116918Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.4116993Z 2025-05-07T20:32:12.4117086Z x_sign = torch.sign(x) 2025-05-07T20:32:12.4117217Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.4117307Z x = x_sign * x_clamp 2025-05-07T20:32:12.4117389Z x0 = x[:, :D] 2025-05-07T20:32:12.4117475Z x1 = x[:, D:] 2025-05-07T20:32:12.4117612Z 2025-05-07T20:32:12.4117698Z if contiguous: 2025-05-07T20:32:12.4117799Z x0 = x0.contiguous() 2025-05-07T20:32:12.4117966Z x1 = x1.contiguous() 2025-05-07T20:32:12.4118038Z 2025-05-07T20:32:12.4118139Z if scale_ub is not None: 2025-05-07T20:32:12.4118246Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:12.4118381Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:12.4118468Z ) 2025-05-07T20:32:12.4118544Z else: 2025-05-07T20:32:12.4118649Z scale_ub_tensor = None 2025-05-07T20:32:12.4118721Z 2025-05-07T20:32:12.4118852Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.4118949Z op = silu_mul_quant 2025-05-07T20:32:12.4119037Z if compiled: 2025-05-07T20:32:12.4119139Z op = torch.compile(op) 2025-05-07T20:32:12.4119253Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.4119328Z 2025-05-07T20:32:12.4119425Z y_fp8, y_scale = fn() 2025-05-07T20:32:12.4119557Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:12.4119631Z 2025-05-07T20:32:12.4119777Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.4119889Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:12.4119991Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:12.4120121Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:12.4120266Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:12.4120340Z 2025-05-07T20:32:12.4120452Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:12.4120456Z 2025-05-07T20:32:12.4120558Z moe/activation_test.py:126: 2025-05-07T20:32:12.4120685Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.4120798Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:12.4120934Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:12.4121502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:12.4121606Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:12.4121963Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:12.4122191Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:12.4122558Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:12.4122821Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:12.4123222Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:12.4123474Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:12.4123868Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:12.4124039Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:12.4124378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:12.4124459Z fn() 2025-05-07T20:32:12.4124914Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:12.4125004Z self.fn.run( 2025-05-07T20:32:12.4125340Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:12.4125436Z kernel = self.compile( 2025-05-07T20:32:12.4125822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:12.4125998Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:12.4126169Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.4126254Z 2025-05-07T20:32:12.4126470Z self = 2025-05-07T20:32:12.4127244Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:12.4127758Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07bc8a7880>} 2025-05-07T20:32:12.4128499Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:12.4128703Z context = 2025-05-07T20:32:12.4128708Z 2025-05-07T20:32:12.4128880Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:12.4129148Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:12.4129264Z module_map=module_map) 2025-05-07T20:32:12.4129429Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:12.4129534Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:12.4129616Z E ^ 2025-05-07T20:32:12.4129968Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:12.4129973Z 2025-05-07T20:32:12.4130390Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:12.4130394Z 2025-05-07T20:32:12.4130500Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.4130722Z self=, 2025-05-07T20:32:12.4130809Z T=1, 2025-05-07T20:32:12.4130892Z D=5120, 2025-05-07T20:32:12.4130983Z scale_ub=None, 2025-05-07T20:32:12.4131070Z contiguous=True, 2025-05-07T20:32:12.4131155Z compiled=False, 2025-05-07T20:32:12.4131234Z ) 2025-05-07T20:32:12.4131450Z self = 2025-05-07T20:32:12.4131620Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:12.4131624Z 2025-05-07T20:32:12.4131710Z @given( 2025-05-07T20:32:12.4131832Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.4131934Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.4132057Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.4132177Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.4132302Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.4132379Z ) 2025-05-07T20:32:12.4132628Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.4132732Z def test_silu_mul_quant( 2025-05-07T20:32:12.4132810Z self, 2025-05-07T20:32:12.4132889Z T: int, 2025-05-07T20:32:12.4132973Z D: int, 2025-05-07T20:32:12.4133074Z scale_ub: Optional[float], 2025-05-07T20:32:12.4133216Z contiguous: bool, 2025-05-07T20:32:12.4133308Z compiled: bool, 2025-05-07T20:32:12.4133384Z ) -> None: 2025-05-07T20:32:12.4133481Z torch.manual_seed(2025) 2025-05-07T20:32:12.4133560Z 2025-05-07T20:32:12.4133729Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.4133815Z 2025-05-07T20:32:12.4133908Z x_sign = torch.sign(x) 2025-05-07T20:32:12.4134037Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.4134138Z x = x_sign * x_clamp 2025-05-07T20:32:12.4134285Z x0 = x[:, :D] 2025-05-07T20:32:12.4134371Z x1 = x[:, D:] 2025-05-07T20:32:12.4134471Z 2025-05-07T20:32:12.4134638Z if contiguous: 2025-05-07T20:32:12.4134740Z x0 = x0.contiguous() 2025-05-07T20:32:12.4134840Z x1 = x1.contiguous() 2025-05-07T20:32:12.4134913Z 2025-05-07T20:32:12.4135005Z if scale_ub is not None: 2025-05-07T20:32:12.4135119Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:12.4135259Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:12.4135339Z ) 2025-05-07T20:32:12.4135422Z else: 2025-05-07T20:32:12.4135519Z scale_ub_tensor = None 2025-05-07T20:32:12.4135599Z 2025-05-07T20:32:12.4135728Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.4135820Z op = silu_mul_quant 2025-05-07T20:32:12.4135917Z if compiled: 2025-05-07T20:32:12.4136020Z op = torch.compile(op) 2025-05-07T20:32:12.4136134Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.4136218Z 2025-05-07T20:32:12.4136310Z > y_fp8, y_scale = fn() 2025-05-07T20:32:12.4136320Z 2025-05-07T20:32:12.4136421Z moe/activation_test.py:117: 2025-05-07T20:32:12.4136559Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.4136662Z moe/activation_test.py:115: in fn 2025-05-07T20:32:12.4136772Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.4137267Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:12.4137367Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:12.4137731Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:12.4137954Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:12.4138292Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:12.4138397Z kernel = self.compile( 2025-05-07T20:32:12.4138791Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:12.4138974Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:12.4139102Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.4139109Z 2025-05-07T20:32:12.4139317Z self = 2025-05-07T20:32:12.4140179Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:12.4140683Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07bc8a6b00>} 2025-05-07T20:32:12.4141450Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:12.4141643Z context = 2025-05-07T20:32:12.4141697Z 2025-05-07T20:32:12.4141869Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:12.4142135Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:12.4142244Z module_map=module_map) 2025-05-07T20:32:12.4142431Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:12.4142533Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:12.4142616Z E ^ 2025-05-07T20:32:12.4148654Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:12.4148742Z 2025-05-07T20:32:12.4149283Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:12.4149289Z 2025-05-07T20:32:12.4149413Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.4149644Z self=, 2025-05-07T20:32:12.4149731Z T=128, 2025-05-07T20:32:12.4149825Z D=5120, 2025-05-07T20:32:12.4149912Z scale_ub=None, 2025-05-07T20:32:12.4150005Z contiguous=False, 2025-05-07T20:32:12.4150101Z compiled=True, 2025-05-07T20:32:12.4150182Z ) 2025-05-07T20:32:12.4150404Z self = 2025-05-07T20:32:12.4150587Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:12.4150592Z 2025-05-07T20:32:12.4150676Z @given( 2025-05-07T20:32:12.4150806Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.4150914Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.4151040Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.4151169Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.4151289Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.4151368Z ) 2025-05-07T20:32:12.4151628Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.4151731Z def test_silu_mul_quant( 2025-05-07T20:32:12.4151812Z self, 2025-05-07T20:32:12.4151899Z T: int, 2025-05-07T20:32:12.4151979Z D: int, 2025-05-07T20:32:12.4152084Z scale_ub: Optional[float], 2025-05-07T20:32:12.4152187Z contiguous: bool, 2025-05-07T20:32:12.4152277Z compiled: bool, 2025-05-07T20:32:12.4152368Z ) -> None: 2025-05-07T20:32:12.4152468Z torch.manual_seed(2025) 2025-05-07T20:32:12.4152547Z 2025-05-07T20:32:12.4152733Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.4152814Z 2025-05-07T20:32:12.4152913Z x_sign = torch.sign(x) 2025-05-07T20:32:12.4153052Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.4153150Z x = x_sign * x_clamp 2025-05-07T20:32:12.4153238Z x0 = x[:, :D] 2025-05-07T20:32:12.4153334Z x1 = x[:, D:] 2025-05-07T20:32:12.4153414Z 2025-05-07T20:32:12.4153507Z if contiguous: 2025-05-07T20:32:12.4153613Z x0 = x0.contiguous() 2025-05-07T20:32:12.4153706Z x1 = x1.contiguous() 2025-05-07T20:32:12.4153797Z 2025-05-07T20:32:12.4153895Z if scale_ub is not None: 2025-05-07T20:32:12.4154006Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:12.4154157Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:12.4154238Z ) 2025-05-07T20:32:12.4154321Z else: 2025-05-07T20:32:12.4154432Z scale_ub_tensor = None 2025-05-07T20:32:12.4154514Z 2025-05-07T20:32:12.4154650Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.4154760Z op = silu_mul_quant 2025-05-07T20:32:12.4154852Z if compiled: 2025-05-07T20:32:12.4154958Z op = torch.compile(op) 2025-05-07T20:32:12.4155080Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.4155162Z 2025-05-07T20:32:12.4155341Z > y_fp8, y_scale = fn() 2025-05-07T20:32:12.4155353Z 2025-05-07T20:32:12.4155459Z moe/activation_test.py:117: 2025-05-07T20:32:12.4155593Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.4155711Z moe/activation_test.py:115: in fn 2025-05-07T20:32:12.4155816Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.4156192Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:12.4156300Z return fn(*args, **kwargs) 2025-05-07T20:32:12.4156984Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:12.4157098Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:12.4157462Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:12.4157698Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:12.4158055Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:12.4158156Z kernel = self.compile( 2025-05-07T20:32:12.4158545Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:12.4158736Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:12.4158869Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.4158876Z 2025-05-07T20:32:12.4159094Z self = 2025-05-07T20:32:12.4159886Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:12.4160402Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07bc56b370>} 2025-05-07T20:32:12.4161157Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:12.4161352Z context = 2025-05-07T20:32:12.4161356Z 2025-05-07T20:32:12.4161539Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:12.4161812Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:12.4161925Z module_map=module_map) 2025-05-07T20:32:12.4162103Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:12.4162209Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:12.4162301Z E ^ 2025-05-07T20:32:12.4162660Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:12.4162665Z 2025-05-07T20:32:12.4163091Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:12.4163095Z 2025-05-07T20:32:12.4163215Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.4163443Z self=, 2025-05-07T20:32:12.4163536Z T=128, 2025-05-07T20:32:12.4163617Z D=7168, 2025-05-07T20:32:12.4163705Z scale_ub=1200.0, 2025-05-07T20:32:12.4163806Z contiguous=False, 2025-05-07T20:32:12.4163904Z compiled=False, 2025-05-07T20:32:12.4163986Z ) 2025-05-07T20:32:12.4164215Z self = 2025-05-07T20:32:12.4164393Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:12.4164446Z 2025-05-07T20:32:12.4164529Z @given( 2025-05-07T20:32:12.4164661Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.4164767Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.4164896Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.4165018Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.4165137Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.4165226Z ) 2025-05-07T20:32:12.4165476Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.4166296Z def test_silu_mul_quant( 2025-05-07T20:32:12.4166383Z self, 2025-05-07T20:32:12.4166542Z T: int, 2025-05-07T20:32:12.4166624Z D: int, 2025-05-07T20:32:12.4166740Z scale_ub: Optional[float], 2025-05-07T20:32:12.4166837Z contiguous: bool, 2025-05-07T20:32:12.4166928Z compiled: bool, 2025-05-07T20:32:12.4167021Z ) -> None: 2025-05-07T20:32:12.4167125Z torch.manual_seed(2025) 2025-05-07T20:32:12.4167208Z 2025-05-07T20:32:12.4167385Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.4167465Z 2025-05-07T20:32:12.4167568Z x_sign = torch.sign(x) 2025-05-07T20:32:12.4167698Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.4167793Z x = x_sign * x_clamp 2025-05-07T20:32:12.4167885Z x0 = x[:, :D] 2025-05-07T20:32:12.4167972Z x1 = x[:, D:] 2025-05-07T20:32:12.4168049Z 2025-05-07T20:32:12.4168149Z if contiguous: 2025-05-07T20:32:12.4168246Z x0 = x0.contiguous() 2025-05-07T20:32:12.4168344Z x1 = x1.contiguous() 2025-05-07T20:32:12.4168428Z 2025-05-07T20:32:12.4168524Z if scale_ub is not None: 2025-05-07T20:32:12.4168636Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:12.4168783Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:12.4168867Z ) 2025-05-07T20:32:12.4168960Z else: 2025-05-07T20:32:12.4169058Z scale_ub_tensor = None 2025-05-07T20:32:12.4169138Z 2025-05-07T20:32:12.4169281Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.4169380Z op = silu_mul_quant 2025-05-07T20:32:12.4169472Z if compiled: 2025-05-07T20:32:12.4169588Z op = torch.compile(op) 2025-05-07T20:32:12.4169696Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.4169777Z 2025-05-07T20:32:12.4169882Z > y_fp8, y_scale = fn() 2025-05-07T20:32:12.4169890Z 2025-05-07T20:32:12.4169992Z moe/activation_test.py:117: 2025-05-07T20:32:12.4170133Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.4170247Z moe/activation_test.py:115: in fn 2025-05-07T20:32:12.4170351Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.4170858Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:12.4170964Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:12.4171330Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:12.4171567Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:12.4171916Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:12.4172015Z kernel = self.compile( 2025-05-07T20:32:12.4172412Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:12.4172595Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:12.4172732Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.4172737Z 2025-05-07T20:32:12.4172947Z self = 2025-05-07T20:32:12.4173794Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:12.4174315Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07bc56a560>} 2025-05-07T20:32:12.4175140Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:12.4175381Z context = 2025-05-07T20:32:12.4175385Z 2025-05-07T20:32:12.4175553Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:12.4175833Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:12.4175944Z module_map=module_map) 2025-05-07T20:32:12.4176111Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:12.4176222Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:12.4176303Z E ^ 2025-05-07T20:32:12.4176661Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:12.4176666Z 2025-05-07T20:32:12.4177090Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:12.4177097Z 2025-05-07T20:32:12.4177208Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.4177440Z self=, 2025-05-07T20:32:12.4177521Z T=128, 2025-05-07T20:32:12.4177602Z D=5120, 2025-05-07T20:32:12.4177697Z scale_ub=None, 2025-05-07T20:32:12.4177792Z contiguous=False, 2025-05-07T20:32:12.4177883Z compiled=False, 2025-05-07T20:32:12.4177967Z ) 2025-05-07T20:32:12.4178186Z self = 2025-05-07T20:32:12.4178360Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:12.4178364Z 2025-05-07T20:32:12.4178454Z @given( 2025-05-07T20:32:12.4178578Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.4178687Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.4178809Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.4178930Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.4179060Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.4179139Z ) 2025-05-07T20:32:12.4179386Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.4179496Z def test_silu_mul_quant( 2025-05-07T20:32:12.4179580Z self, 2025-05-07T20:32:12.4179663Z T: int, 2025-05-07T20:32:12.4179749Z D: int, 2025-05-07T20:32:12.4179991Z scale_ub: Optional[float], 2025-05-07T20:32:12.4180093Z contiguous: bool, 2025-05-07T20:32:12.4180184Z compiled: bool, 2025-05-07T20:32:12.4180266Z ) -> None: 2025-05-07T20:32:12.4180370Z torch.manual_seed(2025) 2025-05-07T20:32:12.4180446Z 2025-05-07T20:32:12.4180620Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.4180704Z 2025-05-07T20:32:12.4180804Z x_sign = torch.sign(x) 2025-05-07T20:32:12.4180932Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.4181041Z x = x_sign * x_clamp 2025-05-07T20:32:12.4181128Z x0 = x[:, :D] 2025-05-07T20:32:12.4181213Z x1 = x[:, D:] 2025-05-07T20:32:12.4181296Z 2025-05-07T20:32:12.4181385Z if contiguous: 2025-05-07T20:32:12.4181481Z x0 = x0.contiguous() 2025-05-07T20:32:12.4181637Z x1 = x1.contiguous() 2025-05-07T20:32:12.4181716Z 2025-05-07T20:32:12.4181817Z if scale_ub is not None: 2025-05-07T20:32:12.4181927Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:12.4182069Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:12.4182156Z ) 2025-05-07T20:32:12.4182236Z else: 2025-05-07T20:32:12.4182334Z scale_ub_tensor = None 2025-05-07T20:32:12.4182414Z 2025-05-07T20:32:12.4182549Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.4182687Z op = silu_mul_quant 2025-05-07T20:32:12.4182783Z if compiled: 2025-05-07T20:32:12.4182985Z op = torch.compile(op) 2025-05-07T20:32:12.4183099Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.4183181Z 2025-05-07T20:32:12.4183276Z > y_fp8, y_scale = fn() 2025-05-07T20:32:12.4183280Z 2025-05-07T20:32:12.4183388Z moe/activation_test.py:117: 2025-05-07T20:32:12.4183522Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.4183627Z moe/activation_test.py:115: in fn 2025-05-07T20:32:12.4183739Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.4184250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:12.4184352Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:12.4184720Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:12.4184951Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:12.4185304Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:12.4185403Z kernel = self.compile( 2025-05-07T20:32:12.4185788Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:12.4185978Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:12.4186108Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.4186113Z 2025-05-07T20:32:12.4186330Z self = 2025-05-07T20:32:12.4187107Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:12.4187624Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07bc568550>} 2025-05-07T20:32:12.4188379Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:12.4188578Z context = 2025-05-07T20:32:12.4188582Z 2025-05-07T20:32:12.4188756Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:12.4189020Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:12.4189132Z module_map=module_map) 2025-05-07T20:32:12.4189306Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:12.4189414Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:12.4189503Z E ^ 2025-05-07T20:32:12.4190166Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:12.4190174Z 2025-05-07T20:32:12.4190657Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:12.4190828Z 2025-05-07T20:32:12.4190946Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.4191169Z self=, 2025-05-07T20:32:12.4191247Z T=128, 2025-05-07T20:32:12.4191336Z D=5120, 2025-05-07T20:32:12.4191426Z scale_ub=1200.0, 2025-05-07T20:32:12.4191517Z contiguous=True, 2025-05-07T20:32:12.4191605Z compiled=False, 2025-05-07T20:32:12.4191681Z ) 2025-05-07T20:32:12.4191904Z self = 2025-05-07T20:32:12.4192155Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:12.4192159Z 2025-05-07T20:32:12.4192364Z @given( 2025-05-07T20:32:12.4192497Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.4192599Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.4192718Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.4192848Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.4192964Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.4193043Z ) 2025-05-07T20:32:12.4193294Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.4193391Z def test_silu_mul_quant( 2025-05-07T20:32:12.4193480Z self, 2025-05-07T20:32:12.4193561Z T: int, 2025-05-07T20:32:12.4193638Z D: int, 2025-05-07T20:32:12.4193747Z scale_ub: Optional[float], 2025-05-07T20:32:12.4193841Z contiguous: bool, 2025-05-07T20:32:12.4193936Z compiled: bool, 2025-05-07T20:32:12.4194023Z ) -> None: 2025-05-07T20:32:12.4194123Z torch.manual_seed(2025) 2025-05-07T20:32:12.4194195Z 2025-05-07T20:32:12.4194368Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.4194443Z 2025-05-07T20:32:12.4194545Z x_sign = torch.sign(x) 2025-05-07T20:32:12.4194669Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.4194764Z x = x_sign * x_clamp 2025-05-07T20:32:12.4194853Z x0 = x[:, :D] 2025-05-07T20:32:12.4194936Z x1 = x[:, D:] 2025-05-07T20:32:12.4195010Z 2025-05-07T20:32:12.4195099Z if contiguous: 2025-05-07T20:32:12.4195191Z x0 = x0.contiguous() 2025-05-07T20:32:12.4195282Z x1 = x1.contiguous() 2025-05-07T20:32:12.4195362Z 2025-05-07T20:32:12.4195452Z if scale_ub is not None: 2025-05-07T20:32:12.4195561Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:12.4195705Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:12.4195780Z ) 2025-05-07T20:32:12.4195865Z else: 2025-05-07T20:32:12.4195968Z scale_ub_tensor = None 2025-05-07T20:32:12.4196041Z 2025-05-07T20:32:12.4196179Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.4196272Z op = silu_mul_quant 2025-05-07T20:32:12.4196364Z if compiled: 2025-05-07T20:32:12.4196472Z op = torch.compile(op) 2025-05-07T20:32:12.4196580Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.4196653Z 2025-05-07T20:32:12.4196752Z > y_fp8, y_scale = fn() 2025-05-07T20:32:12.4196757Z 2025-05-07T20:32:12.4196855Z moe/activation_test.py:117: 2025-05-07T20:32:12.4196985Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.4197093Z moe/activation_test.py:115: in fn 2025-05-07T20:32:12.4197193Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.4197702Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:12.4197803Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:12.4198168Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:12.4198399Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:12.4198793Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:12.4198895Z kernel = self.compile( 2025-05-07T20:32:12.4199283Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:12.4199457Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:12.4199592Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.4199637Z 2025-05-07T20:32:12.4199846Z self = 2025-05-07T20:32:12.4200716Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:12.4201233Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07bc568700>} 2025-05-07T20:32:12.4201974Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:12.4202173Z context = 2025-05-07T20:32:12.4202177Z 2025-05-07T20:32:12.4202348Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:12.4202627Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:12.4202738Z module_map=module_map) 2025-05-07T20:32:12.4202905Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:12.4203013Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:12.4203093Z E ^ 2025-05-07T20:32:12.4203447Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:12.4203451Z 2025-05-07T20:32:12.4203869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:12.4203874Z 2025-05-07T20:32:12.4203979Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.4204209Z self=, 2025-05-07T20:32:12.4204286Z T=1, 2025-05-07T20:32:12.4204363Z D=7168, 2025-05-07T20:32:12.4204453Z scale_ub=1200.0, 2025-05-07T20:32:12.4204548Z contiguous=True, 2025-05-07T20:32:12.4204633Z compiled=True, 2025-05-07T20:32:12.4204717Z ) 2025-05-07T20:32:12.4204934Z self = 2025-05-07T20:32:12.4205103Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:12.4205118Z 2025-05-07T20:32:12.4205195Z @given( 2025-05-07T20:32:12.4205319Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.4205423Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.4205544Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.4205663Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.4205787Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.4205865Z ) 2025-05-07T20:32:12.4206110Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.4206218Z def test_silu_mul_quant( 2025-05-07T20:32:12.4206295Z self, 2025-05-07T20:32:12.4206383Z T: int, 2025-05-07T20:32:12.4206466Z D: int, 2025-05-07T20:32:12.4206568Z scale_ub: Optional[float], 2025-05-07T20:32:12.4206666Z contiguous: bool, 2025-05-07T20:32:12.4206756Z compiled: bool, 2025-05-07T20:32:12.4206834Z ) -> None: 2025-05-07T20:32:12.4207007Z torch.manual_seed(2025) 2025-05-07T20:32:12.4207079Z 2025-05-07T20:32:12.4207250Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.4207327Z 2025-05-07T20:32:12.4207421Z x_sign = torch.sign(x) 2025-05-07T20:32:12.4207548Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.4207650Z x = x_sign * x_clamp 2025-05-07T20:32:12.4207731Z x0 = x[:, :D] 2025-05-07T20:32:12.4207812Z x1 = x[:, D:] 2025-05-07T20:32:12.4207897Z 2025-05-07T20:32:12.4208031Z if contiguous: 2025-05-07T20:32:12.4208133Z x0 = x0.contiguous() 2025-05-07T20:32:12.4208301Z x1 = x1.contiguous() 2025-05-07T20:32:12.4208376Z 2025-05-07T20:32:12.4208484Z if scale_ub is not None: 2025-05-07T20:32:12.4208591Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:12.4208729Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:12.4208819Z ) 2025-05-07T20:32:12.4208896Z else: 2025-05-07T20:32:12.4209007Z scale_ub_tensor = None 2025-05-07T20:32:12.4209079Z 2025-05-07T20:32:12.4209215Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.4209314Z op = silu_mul_quant 2025-05-07T20:32:12.4209402Z if compiled: 2025-05-07T20:32:12.4209504Z op = torch.compile(op) 2025-05-07T20:32:12.4209619Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.4209693Z 2025-05-07T20:32:12.4209789Z > y_fp8, y_scale = fn() 2025-05-07T20:32:12.4209796Z 2025-05-07T20:32:12.4209908Z moe/activation_test.py:117: 2025-05-07T20:32:12.4210044Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.4210155Z moe/activation_test.py:115: in fn 2025-05-07T20:32:12.4210261Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.4210634Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:12.4210742Z return fn(*args, **kwargs) 2025-05-07T20:32:12.4211235Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:12.4211334Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:12.4211698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:12.4211921Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:12.4212271Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:12.4212375Z kernel = self.compile( 2025-05-07T20:32:12.4212764Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:12.4212952Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:12.4213081Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.4213086Z 2025-05-07T20:32:12.4213301Z self = 2025-05-07T20:32:12.4214069Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:12.4214575Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07bcd18280>} 2025-05-07T20:32:12.4215333Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:12.4215524Z context = 2025-05-07T20:32:12.4215576Z 2025-05-07T20:32:12.4215749Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:12.4216013Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:12.4216124Z module_map=module_map) 2025-05-07T20:32:12.4216299Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:12.4216400Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:12.4216482Z E ^ 2025-05-07T20:32:12.4216885Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:12.4216988Z 2025-05-07T20:32:12.4217409Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:12.4217413Z 2025-05-07T20:32:12.4217525Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.4217749Z self=, 2025-05-07T20:32:12.4217828Z T=1, 2025-05-07T20:32:12.4217912Z D=7168, 2025-05-07T20:32:12.4217997Z scale_ub=1200.0, 2025-05-07T20:32:12.4218090Z contiguous=False, 2025-05-07T20:32:12.4218176Z compiled=True, 2025-05-07T20:32:12.4218250Z ) 2025-05-07T20:32:12.4218472Z self = 2025-05-07T20:32:12.4218644Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:12.4218653Z 2025-05-07T20:32:12.4218734Z @given( 2025-05-07T20:32:12.4218865Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.4218972Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.4219090Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.4219217Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.4219334Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.4219415Z ) 2025-05-07T20:32:12.4219660Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.4219756Z def test_silu_mul_quant( 2025-05-07T20:32:12.4219949Z self, 2025-05-07T20:32:12.4220026Z T: int, 2025-05-07T20:32:12.4220103Z D: int, 2025-05-07T20:32:12.4220212Z scale_ub: Optional[float], 2025-05-07T20:32:12.4220301Z contiguous: bool, 2025-05-07T20:32:12.4220390Z compiled: bool, 2025-05-07T20:32:12.4220472Z ) -> None: 2025-05-07T20:32:12.4220566Z torch.manual_seed(2025) 2025-05-07T20:32:12.4220642Z 2025-05-07T20:32:12.4220824Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.4220899Z 2025-05-07T20:32:12.4221001Z x_sign = torch.sign(x) 2025-05-07T20:32:12.4221129Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.4221221Z x = x_sign * x_clamp 2025-05-07T20:32:12.4221309Z x0 = x[:, :D] 2025-05-07T20:32:12.4221392Z x1 = x[:, D:] 2025-05-07T20:32:12.4221463Z 2025-05-07T20:32:12.4221551Z if contiguous: 2025-05-07T20:32:12.4221645Z x0 = x0.contiguous() 2025-05-07T20:32:12.4221735Z x1 = x1.contiguous() 2025-05-07T20:32:12.4221815Z 2025-05-07T20:32:12.4221907Z if scale_ub is not None: 2025-05-07T20:32:12.4222015Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:12.4222161Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:12.4222238Z ) 2025-05-07T20:32:12.4222325Z else: 2025-05-07T20:32:12.4222423Z scale_ub_tensor = None 2025-05-07T20:32:12.4222498Z 2025-05-07T20:32:12.4222640Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.4222734Z op = silu_mul_quant 2025-05-07T20:32:12.4222822Z if compiled: 2025-05-07T20:32:12.4222931Z op = torch.compile(op) 2025-05-07T20:32:12.4223042Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.4223177Z 2025-05-07T20:32:12.4223280Z > y_fp8, y_scale = fn() 2025-05-07T20:32:12.4223284Z 2025-05-07T20:32:12.4223383Z moe/activation_test.py:117: 2025-05-07T20:32:12.4223512Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.4223631Z moe/activation_test.py:115: in fn 2025-05-07T20:32:12.4223736Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.4224111Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:12.4224259Z return fn(*args, **kwargs) 2025-05-07T20:32:12.4224832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:12.4224942Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:12.4225306Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:12.4225542Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:12.4225892Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:12.4225995Z kernel = self.compile( 2025-05-07T20:32:12.4226388Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:12.4226565Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:12.4226696Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.4226701Z 2025-05-07T20:32:12.4226920Z self = 2025-05-07T20:32:12.4227689Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:12.4228195Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f06abcf75b0>} 2025-05-07T20:32:12.4228937Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:12.4229138Z context = 2025-05-07T20:32:12.4229145Z 2025-05-07T20:32:12.4229311Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:12.4229577Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:12.4229694Z module_map=module_map) 2025-05-07T20:32:12.4229858Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:12.4229959Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:12.4230048Z E ^ 2025-05-07T20:32:12.4230402Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:12.4230407Z 2025-05-07T20:32:12.4230826Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:12.4230830Z 2025-05-07T20:32:12.4230938Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.4231166Z self=, 2025-05-07T20:32:12.4231254Z T=1, 2025-05-07T20:32:12.4231330Z D=7168, 2025-05-07T20:32:12.4231419Z scale_ub=None, 2025-05-07T20:32:12.4231517Z contiguous=False, 2025-05-07T20:32:12.4231607Z compiled=True, 2025-05-07T20:32:12.4231686Z ) 2025-05-07T20:32:12.4231903Z self = 2025-05-07T20:32:12.4232125Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:12.4232130Z 2025-05-07T20:32:12.4232211Z @given( 2025-05-07T20:32:12.4232331Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.4232433Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.4232556Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.4232674Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.4232790Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.4232872Z ) 2025-05-07T20:32:12.4233167Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.4233266Z def test_silu_mul_quant( 2025-05-07T20:32:12.4233424Z self, 2025-05-07T20:32:12.4233503Z T: int, 2025-05-07T20:32:12.4233586Z D: int, 2025-05-07T20:32:12.4233686Z scale_ub: Optional[float], 2025-05-07T20:32:12.4233777Z contiguous: bool, 2025-05-07T20:32:12.4233874Z compiled: bool, 2025-05-07T20:32:12.4233954Z ) -> None: 2025-05-07T20:32:12.4234048Z torch.manual_seed(2025) 2025-05-07T20:32:12.4234127Z 2025-05-07T20:32:12.4234295Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.4234367Z 2025-05-07T20:32:12.4234467Z x_sign = torch.sign(x) 2025-05-07T20:32:12.4234591Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.4234689Z x = x_sign * x_clamp 2025-05-07T20:32:12.4234770Z x0 = x[:, :D] 2025-05-07T20:32:12.4234850Z x1 = x[:, D:] 2025-05-07T20:32:12.4234930Z 2025-05-07T20:32:12.4235016Z if contiguous: 2025-05-07T20:32:12.4235115Z x0 = x0.contiguous() 2025-05-07T20:32:12.4235214Z x1 = x1.contiguous() 2025-05-07T20:32:12.4235285Z 2025-05-07T20:32:12.4235376Z if scale_ub is not None: 2025-05-07T20:32:12.4235491Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:12.4235626Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:12.4235704Z ) 2025-05-07T20:32:12.4235790Z else: 2025-05-07T20:32:12.4235885Z scale_ub_tensor = None 2025-05-07T20:32:12.4235963Z 2025-05-07T20:32:12.4236096Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.4236185Z op = silu_mul_quant 2025-05-07T20:32:12.4236277Z if compiled: 2025-05-07T20:32:12.4236379Z op = torch.compile(op) 2025-05-07T20:32:12.4236488Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.4236569Z 2025-05-07T20:32:12.4236662Z y_fp8, y_scale = fn() 2025-05-07T20:32:12.4236791Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:12.4236873Z 2025-05-07T20:32:12.4237013Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.4237117Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:12.4237226Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:12.4237352Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:12.4237499Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:12.4237575Z 2025-05-07T20:32:12.4237676Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:12.4237681Z 2025-05-07T20:32:12.4237787Z moe/activation_test.py:126: 2025-05-07T20:32:12.4237915Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.4238022Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:12.4238165Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:12.4238728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:12.4238840Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:12.4239199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:12.4239472Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:12.4239848Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:12.4240103Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:12.4240502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:12.4240758Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:12.4241260Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:12.4241436Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:12.4241783Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:12.4241862Z fn() 2025-05-07T20:32:12.4242268Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:12.4242352Z self.fn.run( 2025-05-07T20:32:12.4242695Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:12.4242793Z kernel = self.compile( 2025-05-07T20:32:12.4243176Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:12.4243367Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:12.4243500Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.4243504Z 2025-05-07T20:32:12.4243710Z self = 2025-05-07T20:32:12.4244487Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:12.4244987Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f06abcf5bd0>} 2025-05-07T20:32:12.4245734Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:12.4245931Z context = 2025-05-07T20:32:12.4245936Z 2025-05-07T20:32:12.4246112Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:12.4246380Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:12.4246487Z module_map=module_map) 2025-05-07T20:32:12.4246661Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:12.4246763Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:12.4246841Z E ^ 2025-05-07T20:32:12.4247200Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:12.4247205Z 2025-05-07T20:32:12.4247615Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:12.4247619Z 2025-05-07T20:32:12.4247732Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.4247953Z self=, 2025-05-07T20:32:12.4248032Z T=1, 2025-05-07T20:32:12.4248115Z D=5120, 2025-05-07T20:32:12.4248200Z scale_ub=1200.0, 2025-05-07T20:32:12.4248289Z contiguous=False, 2025-05-07T20:32:12.4248378Z compiled=True, 2025-05-07T20:32:12.4248451Z ) 2025-05-07T20:32:12.4248725Z self = 2025-05-07T20:32:12.4248896Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:12.4248901Z 2025-05-07T20:32:12.4248981Z @given( 2025-05-07T20:32:12.4249107Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.4249212Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.4249330Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.4249458Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.4249649Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.4249722Z ) 2025-05-07T20:32:12.4250050Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.4250146Z def test_silu_mul_quant( 2025-05-07T20:32:12.4250231Z self, 2025-05-07T20:32:12.4250310Z T: int, 2025-05-07T20:32:12.4250389Z D: int, 2025-05-07T20:32:12.4250502Z scale_ub: Optional[float], 2025-05-07T20:32:12.4250595Z contiguous: bool, 2025-05-07T20:32:12.4250680Z compiled: bool, 2025-05-07T20:32:12.4250764Z ) -> None: 2025-05-07T20:32:12.4250861Z torch.manual_seed(2025) 2025-05-07T20:32:12.4250934Z 2025-05-07T20:32:12.4251111Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.4251184Z 2025-05-07T20:32:12.4251277Z x_sign = torch.sign(x) 2025-05-07T20:32:12.4251410Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.4251504Z x = x_sign * x_clamp 2025-05-07T20:32:12.4251595Z x0 = x[:, :D] 2025-05-07T20:32:12.4251674Z x1 = x[:, D:] 2025-05-07T20:32:12.4251751Z 2025-05-07T20:32:12.4251842Z if contiguous: 2025-05-07T20:32:12.4251936Z x0 = x0.contiguous() 2025-05-07T20:32:12.4252028Z x1 = x1.contiguous() 2025-05-07T20:32:12.4252105Z 2025-05-07T20:32:12.4252197Z if scale_ub is not None: 2025-05-07T20:32:12.4252309Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:12.4252451Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:12.4252526Z ) 2025-05-07T20:32:12.4252603Z else: 2025-05-07T20:32:12.4252705Z scale_ub_tensor = None 2025-05-07T20:32:12.4252777Z 2025-05-07T20:32:12.4252909Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.4253009Z op = silu_mul_quant 2025-05-07T20:32:12.4253094Z if compiled: 2025-05-07T20:32:12.4253202Z op = torch.compile(op) 2025-05-07T20:32:12.4253314Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.4253386Z 2025-05-07T20:32:12.4253491Z > y_fp8, y_scale = fn() 2025-05-07T20:32:12.4253495Z 2025-05-07T20:32:12.4253600Z moe/activation_test.py:117: 2025-05-07T20:32:12.4253731Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.4253843Z moe/activation_test.py:115: in fn 2025-05-07T20:32:12.4253947Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.4254344Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:12.4254462Z return fn(*args, **kwargs) 2025-05-07T20:32:12.4254964Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:12.4255072Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:12.4255434Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:12.4255662Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:12.4256010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:12.4256107Z kernel = self.compile( 2025-05-07T20:32:12.4256494Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:12.4256728Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:12.4256856Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.4256861Z 2025-05-07T20:32:12.4257073Z self = 2025-05-07T20:32:12.4257844Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:12.4258475Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f06abcf43a0>} 2025-05-07T20:32:12.4259231Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:12.4259428Z context = 2025-05-07T20:32:12.4259439Z 2025-05-07T20:32:12.4259607Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:12.4259989Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:12.4260109Z module_map=module_map) 2025-05-07T20:32:12.4260278Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:12.4260379Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:12.4260470Z E ^ 2025-05-07T20:32:12.4260826Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:12.4260830Z 2025-05-07T20:32:12.4261246Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:12.4261253Z 2025-05-07T20:32:12.4261359Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.4261583Z self=, 2025-05-07T20:32:12.4261667Z T=1, 2025-05-07T20:32:12.4261746Z D=5120, 2025-05-07T20:32:12.4261831Z scale_ub=1200.0, 2025-05-07T20:32:12.4261927Z contiguous=False, 2025-05-07T20:32:12.4262009Z compiled=False, 2025-05-07T20:32:12.4262083Z ) 2025-05-07T20:32:12.4262307Z self = 2025-05-07T20:32:12.4262480Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:12.4262490Z 2025-05-07T20:32:12.4262570Z @given( 2025-05-07T20:32:12.4262690Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.4262791Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.4262918Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.4263046Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.4263162Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.4263241Z ) 2025-05-07T20:32:12.4263487Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.4263582Z def test_silu_mul_quant( 2025-05-07T20:32:12.4263665Z self, 2025-05-07T20:32:12.4263741Z T: int, 2025-05-07T20:32:12.4263823Z D: int, 2025-05-07T20:32:12.4263925Z scale_ub: Optional[float], 2025-05-07T20:32:12.4264018Z contiguous: bool, 2025-05-07T20:32:12.4264109Z compiled: bool, 2025-05-07T20:32:12.4264187Z ) -> None: 2025-05-07T20:32:12.4264287Z torch.manual_seed(2025) 2025-05-07T20:32:12.4264366Z 2025-05-07T20:32:12.4264535Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.4264606Z 2025-05-07T20:32:12.4264705Z x_sign = torch.sign(x) 2025-05-07T20:32:12.4264891Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.4264982Z x = x_sign * x_clamp 2025-05-07T20:32:12.4265069Z x0 = x[:, :D] 2025-05-07T20:32:12.4265151Z x1 = x[:, D:] 2025-05-07T20:32:12.4265228Z 2025-05-07T20:32:12.4265312Z if contiguous: 2025-05-07T20:32:12.4265407Z x0 = x0.contiguous() 2025-05-07T20:32:12.4265505Z x1 = x1.contiguous() 2025-05-07T20:32:12.4265577Z 2025-05-07T20:32:12.4265670Z if scale_ub is not None: 2025-05-07T20:32:12.4265782Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:12.4265964Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:12.4266116Z ) 2025-05-07T20:32:12.4266199Z else: 2025-05-07T20:32:12.4266294Z scale_ub_tensor = None 2025-05-07T20:32:12.4266366Z 2025-05-07T20:32:12.4266506Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.4266599Z op = silu_mul_quant 2025-05-07T20:32:12.4266687Z if compiled: 2025-05-07T20:32:12.4266795Z op = torch.compile(op) 2025-05-07T20:32:12.4266903Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.4266979Z 2025-05-07T20:32:12.4267073Z > y_fp8, y_scale = fn() 2025-05-07T20:32:12.4267077Z 2025-05-07T20:32:12.4267178Z moe/activation_test.py:117: 2025-05-07T20:32:12.4267313Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.4267417Z moe/activation_test.py:115: in fn 2025-05-07T20:32:12.4267521Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.4268028Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:12.4268129Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:12.4268498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:12.4268722Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:12.4269060Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:12.4269163Z kernel = self.compile( 2025-05-07T20:32:12.4269544Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:12.4269719Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:12.4269851Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.4269858Z 2025-05-07T20:32:12.4270069Z self = 2025-05-07T20:32:12.4270861Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:12.4271360Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f06abcf4ee0>} 2025-05-07T20:32:12.4272118Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:12.4272309Z context = 2025-05-07T20:32:12.4272316Z 2025-05-07T20:32:12.4272482Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:12.4272770Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:12.4272878Z module_map=module_map) 2025-05-07T20:32:12.4273053Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:12.4278804Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:12.4278907Z E ^ 2025-05-07T20:32:12.4279287Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:12.4279293Z 2025-05-07T20:32:12.4279713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:12.4279718Z 2025-05-07T20:32:12.4279826Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.4280057Z self=, 2025-05-07T20:32:12.4280224Z T=16384, 2025-05-07T20:32:12.4280302Z D=5120, 2025-05-07T20:32:12.4280397Z scale_ub=1200.0, 2025-05-07T20:32:12.4280569Z contiguous=False, 2025-05-07T20:32:12.4280663Z compiled=True, 2025-05-07T20:32:12.4280741Z ) 2025-05-07T20:32:12.4280960Z self = 2025-05-07T20:32:12.4281147Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:12.4281155Z 2025-05-07T20:32:12.4281236Z @given( 2025-05-07T20:32:12.4281361Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.4281472Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.4281589Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.4281708Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.4281832Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.4281911Z ) 2025-05-07T20:32:12.4282168Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.4282263Z def test_silu_mul_quant( 2025-05-07T20:32:12.4282343Z self, 2025-05-07T20:32:12.4282429Z T: int, 2025-05-07T20:32:12.4282505Z D: int, 2025-05-07T20:32:12.4282608Z scale_ub: Optional[float], 2025-05-07T20:32:12.4282709Z contiguous: bool, 2025-05-07T20:32:12.4282799Z compiled: bool, 2025-05-07T20:32:12.4282885Z ) -> None: 2025-05-07T20:32:12.4282989Z torch.manual_seed(2025) 2025-05-07T20:32:12.4283063Z 2025-05-07T20:32:12.4283235Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.4283315Z 2025-05-07T20:32:12.4283410Z x_sign = torch.sign(x) 2025-05-07T20:32:12.4283547Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.4283642Z x = x_sign * x_clamp 2025-05-07T20:32:12.4283726Z x0 = x[:, :D] 2025-05-07T20:32:12.4283817Z x1 = x[:, D:] 2025-05-07T20:32:12.4283896Z 2025-05-07T20:32:12.4283984Z if contiguous: 2025-05-07T20:32:12.4284088Z x0 = x0.contiguous() 2025-05-07T20:32:12.4284187Z x1 = x1.contiguous() 2025-05-07T20:32:12.4284262Z 2025-05-07T20:32:12.4284371Z if scale_ub is not None: 2025-05-07T20:32:12.4284482Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:12.4284620Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:12.4284705Z ) 2025-05-07T20:32:12.4284787Z else: 2025-05-07T20:32:12.4284896Z scale_ub_tensor = None 2025-05-07T20:32:12.4284969Z 2025-05-07T20:32:12.4285102Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.4285204Z op = silu_mul_quant 2025-05-07T20:32:12.4285292Z if compiled: 2025-05-07T20:32:12.4285398Z op = torch.compile(op) 2025-05-07T20:32:12.4285520Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.4285600Z 2025-05-07T20:32:12.4285694Z > y_fp8, y_scale = fn() 2025-05-07T20:32:12.4285699Z 2025-05-07T20:32:12.4285809Z moe/activation_test.py:117: 2025-05-07T20:32:12.4285947Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.4286060Z moe/activation_test.py:115: in fn 2025-05-07T20:32:12.4286164Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.4286539Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:12.4286741Z return fn(*args, **kwargs) 2025-05-07T20:32:12.4287237Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:12.4287340Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:12.4287710Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:12.4287936Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:12.4288464Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:12.4288563Z kernel = self.compile( 2025-05-07T20:32:12.4288955Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:12.4289144Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:12.4289275Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.4289280Z 2025-05-07T20:32:12.4289490Z self = 2025-05-07T20:32:12.4290722Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:12.4291242Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f06abcf69e0>} 2025-05-07T20:32:12.4291996Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:12.4292195Z context = 2025-05-07T20:32:12.4292200Z 2025-05-07T20:32:12.4292375Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:12.4292640Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:12.4292750Z module_map=module_map) 2025-05-07T20:32:12.4292924Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:12.4293029Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:12.4293109Z E ^ 2025-05-07T20:32:12.4293474Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:12.4293479Z 2025-05-07T20:32:12.4293898Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:12.4293903Z 2025-05-07T20:32:12.4294019Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.4294246Z self=, 2025-05-07T20:32:12.4294325Z T=2048, 2025-05-07T20:32:12.4294411Z D=7168, 2025-05-07T20:32:12.4294496Z scale_ub=1200.0, 2025-05-07T20:32:12.4294584Z contiguous=False, 2025-05-07T20:32:12.4294677Z compiled=True, 2025-05-07T20:32:12.4294752Z ) 2025-05-07T20:32:12.4294978Z self = 2025-05-07T20:32:12.4295156Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:12.4295163Z 2025-05-07T20:32:12.4295241Z @given( 2025-05-07T20:32:12.4295375Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.4295478Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.4295598Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.4295727Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.4296012Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.4296085Z ) 2025-05-07T20:32:12.4296338Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.4296435Z def test_silu_mul_quant( 2025-05-07T20:32:12.4296518Z self, 2025-05-07T20:32:12.4296596Z T: int, 2025-05-07T20:32:12.4296672Z D: int, 2025-05-07T20:32:12.4296785Z scale_ub: Optional[float], 2025-05-07T20:32:12.4296878Z contiguous: bool, 2025-05-07T20:32:12.4296966Z compiled: bool, 2025-05-07T20:32:12.4297208Z ) -> None: 2025-05-07T20:32:12.4297304Z torch.manual_seed(2025) 2025-05-07T20:32:12.4297377Z 2025-05-07T20:32:12.4297672Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.4297751Z 2025-05-07T20:32:12.4297846Z x_sign = torch.sign(x) 2025-05-07T20:32:12.4297980Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.4298072Z x = x_sign * x_clamp 2025-05-07T20:32:12.4298170Z x0 = x[:, :D] 2025-05-07T20:32:12.4298252Z x1 = x[:, D:] 2025-05-07T20:32:12.4298324Z 2025-05-07T20:32:12.4298417Z if contiguous: 2025-05-07T20:32:12.4298512Z x0 = x0.contiguous() 2025-05-07T20:32:12.4298604Z x1 = x1.contiguous() 2025-05-07T20:32:12.4298690Z 2025-05-07T20:32:12.4298784Z if scale_ub is not None: 2025-05-07T20:32:12.4298892Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:12.4299039Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:12.4299119Z ) 2025-05-07T20:32:12.4299198Z else: 2025-05-07T20:32:12.4299306Z scale_ub_tensor = None 2025-05-07T20:32:12.4299384Z 2025-05-07T20:32:12.4299527Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.4299622Z op = silu_mul_quant 2025-05-07T20:32:12.4299709Z if compiled: 2025-05-07T20:32:12.4299904Z op = torch.compile(op) 2025-05-07T20:32:12.4300022Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.4300096Z 2025-05-07T20:32:12.4300196Z > y_fp8, y_scale = fn() 2025-05-07T20:32:12.4300200Z 2025-05-07T20:32:12.4300298Z moe/activation_test.py:117: 2025-05-07T20:32:12.4300426Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.4300534Z moe/activation_test.py:115: in fn 2025-05-07T20:32:12.4300634Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.4301001Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:12.4301107Z return fn(*args, **kwargs) 2025-05-07T20:32:12.4301616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:12.4301730Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:12.4302094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:12.4302323Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:12.4302672Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:12.4302767Z kernel = self.compile( 2025-05-07T20:32:12.4303156Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:12.4303345Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:12.4303475Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.4303484Z 2025-05-07T20:32:12.4303703Z self = 2025-05-07T20:32:12.4304483Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:12.4305047Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f06abcf7b50>} 2025-05-07T20:32:12.4305811Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:12.4306005Z context = 2025-05-07T20:32:12.4306054Z 2025-05-07T20:32:12.4306303Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:12.4306573Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:12.4306690Z module_map=module_map) 2025-05-07T20:32:12.4306855Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:12.4306958Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:12.4307048Z E ^ 2025-05-07T20:32:12.4307413Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:12.4307418Z 2025-05-07T20:32:12.4307831Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:12.4307836Z 2025-05-07T20:32:12.4307958Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.4308184Z self=, 2025-05-07T20:32:12.4308268Z T=1, 2025-05-07T20:32:12.4308356Z D=5120, 2025-05-07T20:32:12.4308442Z scale_ub=None, 2025-05-07T20:32:12.4308542Z contiguous=False, 2025-05-07T20:32:12.4308629Z compiled=False, 2025-05-07T20:32:12.4308701Z ) 2025-05-07T20:32:12.4308924Z self = 2025-05-07T20:32:12.4309097Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:12.4309101Z 2025-05-07T20:32:12.4309182Z @given( 2025-05-07T20:32:12.4309311Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.4309412Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.4309535Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.4309658Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.4309776Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.4309856Z ) 2025-05-07T20:32:12.4310107Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.4310207Z def test_silu_mul_quant( 2025-05-07T20:32:12.4310289Z self, 2025-05-07T20:32:12.4310367Z T: int, 2025-05-07T20:32:12.4310446Z D: int, 2025-05-07T20:32:12.4310556Z scale_ub: Optional[float], 2025-05-07T20:32:12.4310651Z contiguous: bool, 2025-05-07T20:32:12.4310737Z compiled: bool, 2025-05-07T20:32:12.4310823Z ) -> None: 2025-05-07T20:32:12.4310923Z torch.manual_seed(2025) 2025-05-07T20:32:12.4311002Z 2025-05-07T20:32:12.4311170Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.4311245Z 2025-05-07T20:32:12.4311348Z x_sign = torch.sign(x) 2025-05-07T20:32:12.4311474Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.4311566Z x = x_sign * x_clamp 2025-05-07T20:32:12.4311654Z x0 = x[:, :D] 2025-05-07T20:32:12.4311739Z x1 = x[:, D:] 2025-05-07T20:32:12.4311812Z 2025-05-07T20:32:12.4311902Z if contiguous: 2025-05-07T20:32:12.4312002Z x0 = x0.contiguous() 2025-05-07T20:32:12.4312092Z x1 = x1.contiguous() 2025-05-07T20:32:12.4312177Z 2025-05-07T20:32:12.4312270Z if scale_ub is not None: 2025-05-07T20:32:12.4312384Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:12.4312573Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:12.4312652Z ) 2025-05-07T20:32:12.4312735Z else: 2025-05-07T20:32:12.4312832Z scale_ub_tensor = None 2025-05-07T20:32:12.4312903Z 2025-05-07T20:32:12.4313039Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.4313131Z op = silu_mul_quant 2025-05-07T20:32:12.4313218Z if compiled: 2025-05-07T20:32:12.4313326Z op = torch.compile(op) 2025-05-07T20:32:12.4313435Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.4313554Z 2025-05-07T20:32:12.4313655Z > y_fp8, y_scale = fn() 2025-05-07T20:32:12.4313771Z 2025-05-07T20:32:12.4313873Z moe/activation_test.py:117: 2025-05-07T20:32:12.4314011Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.4314116Z moe/activation_test.py:115: in fn 2025-05-07T20:32:12.4314217Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.4314733Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:12.4314835Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:12.4315192Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:12.4315424Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:12.4315766Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:12.4315871Z kernel = self.compile( 2025-05-07T20:32:12.4316260Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:12.4316437Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:12.4316572Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.4316579Z 2025-05-07T20:32:12.4316787Z self = 2025-05-07T20:32:12.4317584Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:12.4318084Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f06abe945e0>} 2025-05-07T20:32:12.4318847Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:12.4319048Z context = 2025-05-07T20:32:12.4319053Z 2025-05-07T20:32:12.4319227Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:12.4319496Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:12.4319605Z module_map=module_map) 2025-05-07T20:32:12.4319770Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:12.4319878Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:12.4319957Z E ^ 2025-05-07T20:32:12.4320327Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:12.4320335Z 2025-05-07T20:32:12.4320754Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:12.4320758Z 2025-05-07T20:32:12.4320865Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.4321094Z self=, 2025-05-07T20:32:12.4321220Z T=4096, 2025-05-07T20:32:12.4321300Z D=7168, 2025-05-07T20:32:12.4321392Z scale_ub=1200.0, 2025-05-07T20:32:12.4321478Z contiguous=False, 2025-05-07T20:32:12.4321572Z compiled=False, 2025-05-07T20:32:12.4321644Z ) 2025-05-07T20:32:12.4321862Z self = 2025-05-07T20:32:12.4322045Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:12.4322050Z 2025-05-07T20:32:12.4322131Z @given( 2025-05-07T20:32:12.4322252Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.4322405Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.4322603Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.4322723Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.4322846Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.4322925Z ) 2025-05-07T20:32:12.4323181Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.4323280Z def test_silu_mul_quant( 2025-05-07T20:32:12.4323357Z self, 2025-05-07T20:32:12.4323443Z T: int, 2025-05-07T20:32:12.4323520Z D: int, 2025-05-07T20:32:12.4323626Z scale_ub: Optional[float], 2025-05-07T20:32:12.4323723Z contiguous: bool, 2025-05-07T20:32:12.4323809Z compiled: bool, 2025-05-07T20:32:12.4323893Z ) -> None: 2025-05-07T20:32:12.4323997Z torch.manual_seed(2025) 2025-05-07T20:32:12.4324086Z 2025-05-07T20:32:12.4324286Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.4324377Z 2025-05-07T20:32:12.4324477Z x_sign = torch.sign(x) 2025-05-07T20:32:12.4324609Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.4324701Z x = x_sign * x_clamp 2025-05-07T20:32:12.4324784Z x0 = x[:, :D] 2025-05-07T20:32:12.4324873Z x1 = x[:, D:] 2025-05-07T20:32:12.4324950Z 2025-05-07T20:32:12.4325034Z if contiguous: 2025-05-07T20:32:12.4325134Z x0 = x0.contiguous() 2025-05-07T20:32:12.4325225Z x1 = x1.contiguous() 2025-05-07T20:32:12.4325298Z 2025-05-07T20:32:12.4325397Z if scale_ub is not None: 2025-05-07T20:32:12.4325504Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:12.4325638Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:12.4325725Z ) 2025-05-07T20:32:12.4325800Z else: 2025-05-07T20:32:12.4325899Z scale_ub_tensor = None 2025-05-07T20:32:12.4325983Z 2025-05-07T20:32:12.4326117Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.4326218Z op = silu_mul_quant 2025-05-07T20:32:12.4326307Z if compiled: 2025-05-07T20:32:12.4326410Z op = torch.compile(op) 2025-05-07T20:32:12.4326526Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.4326601Z 2025-05-07T20:32:12.4326696Z > y_fp8, y_scale = fn() 2025-05-07T20:32:12.4326701Z 2025-05-07T20:32:12.4326805Z moe/activation_test.py:117: 2025-05-07T20:32:12.4326935Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.4327038Z moe/activation_test.py:115: in fn 2025-05-07T20:32:12.4327147Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.4327652Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:12.4327759Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:12.4328125Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:12.4328347Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:12.4328699Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:12.4328843Z kernel = self.compile( 2025-05-07T20:32:12.4329238Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:12.4329414Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:12.4329540Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.4329544Z 2025-05-07T20:32:12.4329756Z self = 2025-05-07T20:32:12.4330620Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:12.4331173Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f06abe94ca0>} 2025-05-07T20:32:12.4331917Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:12.4332108Z context = 2025-05-07T20:32:12.4332113Z 2025-05-07T20:32:12.4332288Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:12.4332551Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:12.4332668Z module_map=module_map) 2025-05-07T20:32:12.4332835Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:12.4332943Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:12.4333029Z E ^ 2025-05-07T20:32:12.4333389Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:12.4333397Z 2025-05-07T20:32:12.4333809Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:12.4333822Z 2025-05-07T20:32:12.4333929Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.4334150Z self=, 2025-05-07T20:32:12.4334234Z T=16384, 2025-05-07T20:32:12.4334312Z D=7168, 2025-05-07T20:32:12.4334396Z scale_ub=None, 2025-05-07T20:32:12.4334490Z contiguous=True, 2025-05-07T20:32:12.4334573Z compiled=True, 2025-05-07T20:32:12.4334651Z ) 2025-05-07T20:32:12.4334876Z self = 2025-05-07T20:32:12.4335058Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:12.4335062Z 2025-05-07T20:32:12.4335145Z @given( 2025-05-07T20:32:12.4335267Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.4335369Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.4335495Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.4335620Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.4335736Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.4335815Z ) 2025-05-07T20:32:12.4336067Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.4336163Z def test_silu_mul_quant( 2025-05-07T20:32:12.4336249Z self, 2025-05-07T20:32:12.4336325Z T: int, 2025-05-07T20:32:12.4336402Z D: int, 2025-05-07T20:32:12.4336511Z scale_ub: Optional[float], 2025-05-07T20:32:12.4336605Z contiguous: bool, 2025-05-07T20:32:12.4336705Z compiled: bool, 2025-05-07T20:32:12.4336784Z ) -> None: 2025-05-07T20:32:12.4336879Z torch.manual_seed(2025) 2025-05-07T20:32:12.4336960Z 2025-05-07T20:32:12.4337127Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.4337252Z 2025-05-07T20:32:12.4337354Z x_sign = torch.sign(x) 2025-05-07T20:32:12.4337482Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.4337576Z x = x_sign * x_clamp 2025-05-07T20:32:12.4337662Z x0 = x[:, :D] 2025-05-07T20:32:12.4337746Z x1 = x[:, D:] 2025-05-07T20:32:12.4337819Z 2025-05-07T20:32:12.4337909Z if contiguous: 2025-05-07T20:32:12.4338002Z x0 = x0.contiguous() 2025-05-07T20:32:12.4338103Z x1 = x1.contiguous() 2025-05-07T20:32:12.4338177Z 2025-05-07T20:32:12.4338315Z if scale_ub is not None: 2025-05-07T20:32:12.4338449Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:12.4338669Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:12.4338755Z ) 2025-05-07T20:32:12.4338834Z else: 2025-05-07T20:32:12.4338932Z scale_ub_tensor = None 2025-05-07T20:32:12.4339011Z 2025-05-07T20:32:12.4339146Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.4339239Z op = silu_mul_quant 2025-05-07T20:32:12.4339334Z if compiled: 2025-05-07T20:32:12.4339440Z op = torch.compile(op) 2025-05-07T20:32:12.4339550Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.4339633Z 2025-05-07T20:32:12.4339729Z > y_fp8, y_scale = fn() 2025-05-07T20:32:12.4339733Z 2025-05-07T20:32:12.4339972Z moe/activation_test.py:117: 2025-05-07T20:32:12.4340109Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.4340214Z moe/activation_test.py:115: in fn 2025-05-07T20:32:12.4340320Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.4340694Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:12.4340790Z return fn(*args, **kwargs) 2025-05-07T20:32:12.4341290Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:12.4341397Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:12.4341756Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:12.4341990Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:12.4342333Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:12.4342441Z kernel = self.compile( 2025-05-07T20:32:12.4342833Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:12.4343014Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:12.4343149Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.4343153Z 2025-05-07T20:32:12.4343361Z self = 2025-05-07T20:32:12.4344144Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:12.4344653Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f06abe95b40>} 2025-05-07T20:32:12.4345400Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:12.4345603Z context = 2025-05-07T20:32:12.4345608Z 2025-05-07T20:32:12.4345774Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:12.4346047Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:12.4346270Z module_map=module_map) 2025-05-07T20:32:12.4346436Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:12.4346551Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:12.4346628Z E ^ 2025-05-07T20:32:12.4346985Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:12.4346990Z 2025-05-07T20:32:12.4347402Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:12.4347447Z 2025-05-07T20:32:12.4347660Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.4347888Z self=, 2025-05-07T20:32:12.4347968Z T=4096, 2025-05-07T20:32:12.4348044Z D=5120, 2025-05-07T20:32:12.4348133Z scale_ub=None, 2025-05-07T20:32:12.4348227Z contiguous=False, 2025-05-07T20:32:12.4348318Z compiled=True, 2025-05-07T20:32:12.4348391Z ) 2025-05-07T20:32:12.4348607Z self = 2025-05-07T20:32:12.4348787Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:12.4348791Z 2025-05-07T20:32:12.4348866Z @given( 2025-05-07T20:32:12.4348986Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.4349095Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.4349215Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.4349335Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.4349462Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.4349539Z ) 2025-05-07T20:32:12.4349789Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.4349885Z def test_silu_mul_quant( 2025-05-07T20:32:12.4349963Z self, 2025-05-07T20:32:12.4350049Z T: int, 2025-05-07T20:32:12.4350125Z D: int, 2025-05-07T20:32:12.4350225Z scale_ub: Optional[float], 2025-05-07T20:32:12.4350325Z contiguous: bool, 2025-05-07T20:32:12.4350411Z compiled: bool, 2025-05-07T20:32:12.4350491Z ) -> None: 2025-05-07T20:32:12.4350595Z torch.manual_seed(2025) 2025-05-07T20:32:12.4350668Z 2025-05-07T20:32:12.4350838Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.4350920Z 2025-05-07T20:32:12.4351016Z x_sign = torch.sign(x) 2025-05-07T20:32:12.4351148Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.4351243Z x = x_sign * x_clamp 2025-05-07T20:32:12.4351326Z x0 = x[:, :D] 2025-05-07T20:32:12.4351414Z x1 = x[:, D:] 2025-05-07T20:32:12.4351486Z 2025-05-07T20:32:12.4351571Z if contiguous: 2025-05-07T20:32:12.4351670Z x0 = x0.contiguous() 2025-05-07T20:32:12.4351766Z x1 = x1.contiguous() 2025-05-07T20:32:12.4351838Z 2025-05-07T20:32:12.4351938Z if scale_ub is not None: 2025-05-07T20:32:12.4352044Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:12.4352181Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:12.4352267Z ) 2025-05-07T20:32:12.4352343Z else: 2025-05-07T20:32:12.4352446Z scale_ub_tensor = None 2025-05-07T20:32:12.4352518Z 2025-05-07T20:32:12.4352647Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.4352747Z op = silu_mul_quant 2025-05-07T20:32:12.4352836Z if compiled: 2025-05-07T20:32:12.4352943Z op = torch.compile(op) 2025-05-07T20:32:12.4353057Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.4353132Z 2025-05-07T20:32:12.4353226Z > y_fp8, y_scale = fn() 2025-05-07T20:32:12.4353230Z 2025-05-07T20:32:12.4353338Z moe/activation_test.py:117: 2025-05-07T20:32:12.4353526Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.4353632Z moe/activation_test.py:115: in fn 2025-05-07T20:32:12.4353740Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.4354133Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:12.4354246Z return fn(*args, **kwargs) 2025-05-07T20:32:12.4354751Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:12.4354895Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:12.4355338Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:12.4355568Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:12.4355921Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:12.4356021Z kernel = self.compile( 2025-05-07T20:32:12.4356408Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:12.4356593Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:12.4356720Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.4356725Z 2025-05-07T20:32:12.4356931Z self = 2025-05-07T20:32:12.4357716Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:12.4358223Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f06abe95240>} 2025-05-07T20:32:12.4358975Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:12.4359166Z context = 2025-05-07T20:32:12.4359171Z 2025-05-07T20:32:12.4359343Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:12.4359607Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:12.4359720Z module_map=module_map) 2025-05-07T20:32:12.4359896Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:12.4360001Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:12.4360078Z E ^ 2025-05-07T20:32:12.4360439Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:12.4360446Z 2025-05-07T20:32:12.4360858Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:12.4360862Z 2025-05-07T20:32:12.4360976Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.4361199Z self=, 2025-05-07T20:32:12.4361278Z T=4096, 2025-05-07T20:32:12.4361361Z D=5120, 2025-05-07T20:32:12.4361447Z scale_ub=1200.0, 2025-05-07T20:32:12.4361534Z contiguous=False, 2025-05-07T20:32:12.4361630Z compiled=False, 2025-05-07T20:32:12.4361707Z ) 2025-05-07T20:32:12.4361934Z self = 2025-05-07T20:32:12.4362114Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:12.4362118Z 2025-05-07T20:32:12.4362198Z @given( 2025-05-07T20:32:12.4362323Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.4362473Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.4362592Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.4362717Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.4362833Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.4362905Z ) 2025-05-07T20:32:12.4363156Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.4363251Z def test_silu_mul_quant( 2025-05-07T20:32:12.4363331Z self, 2025-05-07T20:32:12.4363448Z T: int, 2025-05-07T20:32:12.4363522Z D: int, 2025-05-07T20:32:12.4363630Z scale_ub: Optional[float], 2025-05-07T20:32:12.4363797Z contiguous: bool, 2025-05-07T20:32:12.4363886Z compiled: bool, 2025-05-07T20:32:12.4363972Z ) -> None: 2025-05-07T20:32:12.4364067Z torch.manual_seed(2025) 2025-05-07T20:32:12.4364139Z 2025-05-07T20:32:12.4364317Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.4364394Z 2025-05-07T20:32:12.4364488Z x_sign = torch.sign(x) 2025-05-07T20:32:12.4364619Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.4364710Z x = x_sign * x_clamp 2025-05-07T20:32:12.4364796Z x0 = x[:, :D] 2025-05-07T20:32:12.4364878Z x1 = x[:, D:] 2025-05-07T20:32:12.4364953Z 2025-05-07T20:32:12.4365044Z if contiguous: 2025-05-07T20:32:12.4365138Z x0 = x0.contiguous() 2025-05-07T20:32:12.4365230Z x1 = x1.contiguous() 2025-05-07T20:32:12.4365311Z 2025-05-07T20:32:12.4365405Z if scale_ub is not None: 2025-05-07T20:32:12.4365516Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:12.4365657Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:12.4365734Z ) 2025-05-07T20:32:12.4365812Z else: 2025-05-07T20:32:12.4365916Z scale_ub_tensor = None 2025-05-07T20:32:12.4365992Z 2025-05-07T20:32:12.4366125Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.4366224Z op = silu_mul_quant 2025-05-07T20:32:12.4366311Z if compiled: 2025-05-07T20:32:12.4366420Z op = torch.compile(op) 2025-05-07T20:32:12.4366529Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.4366600Z 2025-05-07T20:32:12.4366698Z > y_fp8, y_scale = fn() 2025-05-07T20:32:12.4366702Z 2025-05-07T20:32:12.4366806Z moe/activation_test.py:117: 2025-05-07T20:32:12.4366934Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.4367047Z moe/activation_test.py:115: in fn 2025-05-07T20:32:12.4367156Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.4367665Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:12.4367766Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:12.4368132Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:12.4368359Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:12.4368703Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:12.4368801Z kernel = self.compile( 2025-05-07T20:32:12.4369190Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:12.4369368Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:12.4369505Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.4369509Z 2025-05-07T20:32:12.4369717Z self = 2025-05-07T20:32:12.4370488Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:12.4371067Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f06abe96cb0>} 2025-05-07T20:32:12.4371809Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:12.4372049Z context = 2025-05-07T20:32:12.4372132Z 2025-05-07T20:32:12.4372306Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:12.4372572Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:12.4372682Z module_map=module_map) 2025-05-07T20:32:12.4372847Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:12.4372956Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:12.4373034Z E ^ 2025-05-07T20:32:12.4373386Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:12.4373391Z 2025-05-07T20:32:12.4373808Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:12.4373819Z 2025-05-07T20:32:12.4373927Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.4374159Z self=, 2025-05-07T20:32:12.4374240Z T=4096, 2025-05-07T20:32:12.4374317Z D=5120, 2025-05-07T20:32:12.4374408Z scale_ub=1200.0, 2025-05-07T20:32:12.4374495Z contiguous=False, 2025-05-07T20:32:12.4374580Z compiled=True, 2025-05-07T20:32:12.4374661Z ) 2025-05-07T20:32:12.4374877Z self = 2025-05-07T20:32:12.4375052Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:12.4375056Z 2025-05-07T20:32:12.4375139Z @given( 2025-05-07T20:32:12.4375259Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.4375369Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.4375486Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.4375606Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.4375731Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.4375808Z ) 2025-05-07T20:32:12.4376058Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.4376161Z def test_silu_mul_quant( 2025-05-07T20:32:12.4376236Z self, 2025-05-07T20:32:12.4376314Z T: int, 2025-05-07T20:32:12.4376401Z D: int, 2025-05-07T20:32:12.4376505Z scale_ub: Optional[float], 2025-05-07T20:32:12.4376604Z contiguous: bool, 2025-05-07T20:32:12.4376691Z compiled: bool, 2025-05-07T20:32:12.4376769Z ) -> None: 2025-05-07T20:32:12.4376870Z torch.manual_seed(2025) 2025-05-07T20:32:12.4376942Z 2025-05-07T20:32:12.4377113Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.4377192Z 2025-05-07T20:32:12.4377286Z x_sign = torch.sign(x) 2025-05-07T20:32:12.4377412Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.4377513Z x = x_sign * x_clamp 2025-05-07T20:32:12.4377594Z x0 = x[:, :D] 2025-05-07T20:32:12.4377676Z x1 = x[:, D:] 2025-05-07T20:32:12.4377761Z 2025-05-07T20:32:12.4377845Z if contiguous: 2025-05-07T20:32:12.4377939Z x0 = x0.contiguous() 2025-05-07T20:32:12.4378041Z x1 = x1.contiguous() 2025-05-07T20:32:12.4378114Z 2025-05-07T20:32:12.4378212Z if scale_ub is not None: 2025-05-07T20:32:12.4378376Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:12.4378512Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:12.4378592Z ) 2025-05-07T20:32:12.4378672Z else: 2025-05-07T20:32:12.4378770Z scale_ub_tensor = None 2025-05-07T20:32:12.4378850Z 2025-05-07T20:32:12.4378985Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.4379076Z op = silu_mul_quant 2025-05-07T20:32:12.4379172Z if compiled: 2025-05-07T20:32:12.4379352Z op = torch.compile(op) 2025-05-07T20:32:12.4379460Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.4379618Z 2025-05-07T20:32:12.4379715Z > y_fp8, y_scale = fn() 2025-05-07T20:32:12.4379719Z 2025-05-07T20:32:12.4379960Z moe/activation_test.py:117: 2025-05-07T20:32:12.4380093Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.4380200Z moe/activation_test.py:115: in fn 2025-05-07T20:32:12.4380304Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.4380674Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:12.4380772Z return fn(*args, **kwargs) 2025-05-07T20:32:12.4381269Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:12.4381368Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:12.4381729Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:12.4381963Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:12.4382307Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:12.4382410Z kernel = self.compile( 2025-05-07T20:32:12.4382800Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:12.4382977Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:12.4383111Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.4383115Z 2025-05-07T20:32:12.4383322Z self = 2025-05-07T20:32:12.4384116Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:12.4384623Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f06abe96b90>} 2025-05-07T20:32:12.4385373Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:12.4385568Z context = 2025-05-07T20:32:12.4385573Z 2025-05-07T20:32:12.4385738Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:12.4386005Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:12.4386115Z module_map=module_map) 2025-05-07T20:32:12.4386294Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:12.4386395Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:12.4386480Z E ^ 2025-05-07T20:32:12.4386839Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:12.4386843Z 2025-05-07T20:32:12.4387256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:12.4387310Z 2025-05-07T20:32:12.4387417Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.4387645Z self=, 2025-05-07T20:32:12.4387726Z T=2048, 2025-05-07T20:32:12.4387813Z D=7168, 2025-05-07T20:32:12.4387899Z scale_ub=1200.0, 2025-05-07T20:32:12.4387988Z contiguous=False, 2025-05-07T20:32:12.4388079Z compiled=False, 2025-05-07T20:32:12.4388149Z ) 2025-05-07T20:32:12.4388409Z self = 2025-05-07T20:32:12.4388667Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:12.4388672Z 2025-05-07T20:32:12.4388751Z @given( 2025-05-07T20:32:12.4388871Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.4388979Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.4389101Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.4389224Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.4389343Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.4389416Z ) 2025-05-07T20:32:12.4389666Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.4389763Z def test_silu_mul_quant( 2025-05-07T20:32:12.4390135Z self, 2025-05-07T20:32:12.4390262Z T: int, 2025-05-07T20:32:12.4390372Z D: int, 2025-05-07T20:32:12.4390508Z scale_ub: Optional[float], 2025-05-07T20:32:12.4390613Z contiguous: bool, 2025-05-07T20:32:12.4390704Z compiled: bool, 2025-05-07T20:32:12.4390789Z ) -> None: 2025-05-07T20:32:12.4390891Z torch.manual_seed(2025) 2025-05-07T20:32:12.4390964Z 2025-05-07T20:32:12.4391143Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.4391219Z 2025-05-07T20:32:12.4391316Z x_sign = torch.sign(x) 2025-05-07T20:32:12.4391447Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.4391540Z x = x_sign * x_clamp 2025-05-07T20:32:12.4391623Z x0 = x[:, :D] 2025-05-07T20:32:12.4391710Z x1 = x[:, D:] 2025-05-07T20:32:12.4391784Z 2025-05-07T20:32:12.4391869Z if contiguous: 2025-05-07T20:32:12.4391968Z x0 = x0.contiguous() 2025-05-07T20:32:12.4392059Z x1 = x1.contiguous() 2025-05-07T20:32:12.4392133Z 2025-05-07T20:32:12.4392233Z if scale_ub is not None: 2025-05-07T20:32:12.4392342Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:12.4392482Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:12.4392563Z ) 2025-05-07T20:32:12.4392642Z else: 2025-05-07T20:32:12.4392746Z scale_ub_tensor = None 2025-05-07T20:32:12.4392818Z 2025-05-07T20:32:12.4392949Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.4393050Z op = silu_mul_quant 2025-05-07T20:32:12.4393136Z if compiled: 2025-05-07T20:32:12.4393237Z op = torch.compile(op) 2025-05-07T20:32:12.4393353Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.4393426Z 2025-05-07T20:32:12.4393519Z > y_fp8, y_scale = fn() 2025-05-07T20:32:12.4393529Z 2025-05-07T20:32:12.4393628Z moe/activation_test.py:117: 2025-05-07T20:32:12.4393757Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.4393867Z moe/activation_test.py:115: in fn 2025-05-07T20:32:12.4393972Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.4394477Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:12.4394583Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:12.4394948Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:12.4395344Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:12.4395698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:12.4395795Z kernel = self.compile( 2025-05-07T20:32:12.4396183Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:12.4396360Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:12.4396560Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.4396564Z 2025-05-07T20:32:12.4396899Z self = 2025-05-07T20:32:12.4397684Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:12.4398203Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f06ab5dc5e0>} 2025-05-07T20:32:12.4398942Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:12.4399141Z context = 2025-05-07T20:32:12.4399148Z 2025-05-07T20:32:12.4399318Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:12.4399591Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:12.4399707Z module_map=module_map) 2025-05-07T20:32:12.4399870Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:12.4399973Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:12.4400056Z E ^ 2025-05-07T20:32:12.4400414Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:12.4400418Z 2025-05-07T20:32:12.4400837Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:12.4400842Z 2025-05-07T20:32:12.4400950Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.4401172Z self=, 2025-05-07T20:32:12.4401262Z T=1, 2025-05-07T20:32:12.4401340Z D=7168, 2025-05-07T20:32:12.4401428Z scale_ub=None, 2025-05-07T20:32:12.4401522Z contiguous=True, 2025-05-07T20:32:12.4401607Z compiled=False, 2025-05-07T20:32:12.4401702Z ) 2025-05-07T20:32:12.4401919Z self = 2025-05-07T20:32:12.4402087Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:12.4402091Z 2025-05-07T20:32:12.4402179Z @given( 2025-05-07T20:32:12.4407975Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.4408106Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.4408245Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.4408369Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.4408487Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.4408577Z ) 2025-05-07T20:32:12.4408835Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.4408938Z def test_silu_mul_quant( 2025-05-07T20:32:12.4409025Z self, 2025-05-07T20:32:12.4409105Z T: int, 2025-05-07T20:32:12.4409189Z D: int, 2025-05-07T20:32:12.4409291Z scale_ub: Optional[float], 2025-05-07T20:32:12.4409388Z contiguous: bool, 2025-05-07T20:32:12.4409564Z compiled: bool, 2025-05-07T20:32:12.4409646Z ) -> None: 2025-05-07T20:32:12.4409744Z torch.manual_seed(2025) 2025-05-07T20:32:12.4409831Z 2025-05-07T20:32:12.4410007Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.4410082Z 2025-05-07T20:32:12.4410186Z x_sign = torch.sign(x) 2025-05-07T20:32:12.4410312Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.4410404Z x = x_sign * x_clamp 2025-05-07T20:32:12.4410491Z x0 = x[:, :D] 2025-05-07T20:32:12.4410624Z x1 = x[:, D:] 2025-05-07T20:32:12.4410698Z 2025-05-07T20:32:12.4410793Z if contiguous: 2025-05-07T20:32:12.4410967Z x0 = x0.contiguous() 2025-05-07T20:32:12.4411071Z x1 = x1.contiguous() 2025-05-07T20:32:12.4411147Z 2025-05-07T20:32:12.4411240Z if scale_ub is not None: 2025-05-07T20:32:12.4411354Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:12.4411494Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:12.4411579Z ) 2025-05-07T20:32:12.4411665Z else: 2025-05-07T20:32:12.4411763Z scale_ub_tensor = None 2025-05-07T20:32:12.4411837Z 2025-05-07T20:32:12.4411983Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.4412077Z op = silu_mul_quant 2025-05-07T20:32:12.4412168Z if compiled: 2025-05-07T20:32:12.4412281Z op = torch.compile(op) 2025-05-07T20:32:12.4412392Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.4412480Z 2025-05-07T20:32:12.4412576Z > y_fp8, y_scale = fn() 2025-05-07T20:32:12.4412581Z 2025-05-07T20:32:12.4412691Z moe/activation_test.py:117: 2025-05-07T20:32:12.4412833Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.4412940Z moe/activation_test.py:115: in fn 2025-05-07T20:32:12.4413043Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.4413558Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:12.4413663Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:12.4414032Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:12.4414262Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:12.4414605Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:12.4414714Z kernel = self.compile( 2025-05-07T20:32:12.4415109Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:12.4415288Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:12.4415426Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.4415434Z 2025-05-07T20:32:12.4415642Z self = 2025-05-07T20:32:12.4416425Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:12.4416927Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f06ab5dcd30>} 2025-05-07T20:32:12.4417688Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:12.4417884Z context = 2025-05-07T20:32:12.4417888Z 2025-05-07T20:32:12.4418057Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:12.4418419Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:12.4418533Z module_map=module_map) 2025-05-07T20:32:12.4418709Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:12.4418813Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:12.4418891Z E ^ 2025-05-07T20:32:12.4419256Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:12.4419302Z 2025-05-07T20:32:12.4419919Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:12.4419925Z 2025-05-07T20:32:12.4420034Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.4420269Z self=, 2025-05-07T20:32:12.4420350Z T=16384, 2025-05-07T20:32:12.4420441Z D=7168, 2025-05-07T20:32:12.4420529Z scale_ub=1200.0, 2025-05-07T20:32:12.4420619Z contiguous=False, 2025-05-07T20:32:12.4420713Z compiled=True, 2025-05-07T20:32:12.4420793Z ) 2025-05-07T20:32:12.4421012Z self = 2025-05-07T20:32:12.4421201Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:12.4421206Z 2025-05-07T20:32:12.4421284Z @given( 2025-05-07T20:32:12.4421406Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.4421516Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.4421635Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.4421769Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.4421890Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.4421964Z ) 2025-05-07T20:32:12.4422220Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.4422322Z def test_silu_mul_quant( 2025-05-07T20:32:12.4422399Z self, 2025-05-07T20:32:12.4422487Z T: int, 2025-05-07T20:32:12.4422566Z D: int, 2025-05-07T20:32:12.4422668Z scale_ub: Optional[float], 2025-05-07T20:32:12.4422767Z contiguous: bool, 2025-05-07T20:32:12.4422855Z compiled: bool, 2025-05-07T20:32:12.4422934Z ) -> None: 2025-05-07T20:32:12.4423041Z torch.manual_seed(2025) 2025-05-07T20:32:12.4423114Z 2025-05-07T20:32:12.4423294Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.4423373Z 2025-05-07T20:32:12.4423469Z x_sign = torch.sign(x) 2025-05-07T20:32:12.4423610Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.4423702Z x = x_sign * x_clamp 2025-05-07T20:32:12.4423784Z x0 = x[:, :D] 2025-05-07T20:32:12.4423876Z x1 = x[:, D:] 2025-05-07T20:32:12.4423947Z 2025-05-07T20:32:12.4424037Z if contiguous: 2025-05-07T20:32:12.4424137Z x0 = x0.contiguous() 2025-05-07T20:32:12.4424233Z x1 = x1.contiguous() 2025-05-07T20:32:12.4424306Z 2025-05-07T20:32:12.4424406Z if scale_ub is not None: 2025-05-07T20:32:12.4424513Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:12.4424658Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:12.4424737Z ) 2025-05-07T20:32:12.4424816Z else: 2025-05-07T20:32:12.4424921Z scale_ub_tensor = None 2025-05-07T20:32:12.4425000Z 2025-05-07T20:32:12.4425133Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.4425231Z op = silu_mul_quant 2025-05-07T20:32:12.4425327Z if compiled: 2025-05-07T20:32:12.4425430Z op = torch.compile(op) 2025-05-07T20:32:12.4425549Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.4425624Z 2025-05-07T20:32:12.4425720Z > y_fp8, y_scale = fn() 2025-05-07T20:32:12.4425775Z 2025-05-07T20:32:12.4425883Z moe/activation_test.py:117: 2025-05-07T20:32:12.4426013Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.4426126Z moe/activation_test.py:115: in fn 2025-05-07T20:32:12.4426230Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.4426604Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:12.4426710Z return fn(*args, **kwargs) 2025-05-07T20:32:12.4427211Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:12.4427425Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:12.4427799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:12.4428024Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:12.4428381Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:12.4428481Z kernel = self.compile( 2025-05-07T20:32:12.4428862Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:12.4429050Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:12.4429176Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.4429181Z 2025-05-07T20:32:12.4429393Z self = 2025-05-07T20:32:12.4430176Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:12.4430676Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f06ab5ddbd0>} 2025-05-07T20:32:12.4431428Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:12.4431620Z context = 2025-05-07T20:32:12.4431624Z 2025-05-07T20:32:12.4431798Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:12.4432064Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:12.4432177Z module_map=module_map) 2025-05-07T20:32:12.4432347Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:12.4432450Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:12.4432526Z E ^ 2025-05-07T20:32:12.4432891Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:12.4432896Z 2025-05-07T20:32:12.4433308Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:12.4433312Z 2025-05-07T20:32:12.4433423Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.4433643Z self=, 2025-05-07T20:32:12.4433725Z T=1, 2025-05-07T20:32:12.4433811Z D=7168, 2025-05-07T20:32:12.4433899Z scale_ub=None, 2025-05-07T20:32:12.4433989Z contiguous=False, 2025-05-07T20:32:12.4434085Z compiled=False, 2025-05-07T20:32:12.4434167Z ) 2025-05-07T20:32:12.4434388Z self = 2025-05-07T20:32:12.4434565Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:12.4434569Z 2025-05-07T20:32:12.4434780Z @given( 2025-05-07T20:32:12.4434911Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.4435013Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.4435130Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.4435256Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.4435372Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.4435449Z ) 2025-05-07T20:32:12.4435703Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.4435841Z def test_silu_mul_quant( 2025-05-07T20:32:12.4435920Z self, 2025-05-07T20:32:12.4436007Z T: int, 2025-05-07T20:32:12.4436154Z D: int, 2025-05-07T20:32:12.4436267Z scale_ub: Optional[float], 2025-05-07T20:32:12.4436360Z contiguous: bool, 2025-05-07T20:32:12.4436446Z compiled: bool, 2025-05-07T20:32:12.4436534Z ) -> None: 2025-05-07T20:32:12.4436633Z torch.manual_seed(2025) 2025-05-07T20:32:12.4436712Z 2025-05-07T20:32:12.4436890Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.4436965Z 2025-05-07T20:32:12.4437059Z x_sign = torch.sign(x) 2025-05-07T20:32:12.4437194Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.4437285Z x = x_sign * x_clamp 2025-05-07T20:32:12.4437366Z x0 = x[:, :D] 2025-05-07T20:32:12.4437453Z x1 = x[:, D:] 2025-05-07T20:32:12.4437526Z 2025-05-07T20:32:12.4437615Z if contiguous: 2025-05-07T20:32:12.4437714Z x0 = x0.contiguous() 2025-05-07T20:32:12.4437804Z x1 = x1.contiguous() 2025-05-07T20:32:12.4437883Z 2025-05-07T20:32:12.4437980Z if scale_ub is not None: 2025-05-07T20:32:12.4438085Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:12.4438226Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:12.4438307Z ) 2025-05-07T20:32:12.4438390Z else: 2025-05-07T20:32:12.4438494Z scale_ub_tensor = None 2025-05-07T20:32:12.4438564Z 2025-05-07T20:32:12.4438695Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.4438794Z op = silu_mul_quant 2025-05-07T20:32:12.4438882Z if compiled: 2025-05-07T20:32:12.4438989Z op = torch.compile(op) 2025-05-07T20:32:12.4439097Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.4439170Z 2025-05-07T20:32:12.4439267Z > y_fp8, y_scale = fn() 2025-05-07T20:32:12.4439271Z 2025-05-07T20:32:12.4439375Z moe/activation_test.py:117: 2025-05-07T20:32:12.4439504Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.4439620Z moe/activation_test.py:115: in fn 2025-05-07T20:32:12.4439721Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.4440224Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:12.4440333Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:12.4440698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:12.4440928Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:12.4441269Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:12.4441366Z kernel = self.compile( 2025-05-07T20:32:12.4441758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:12.4441940Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:12.4442075Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.4442080Z 2025-05-07T20:32:12.4442291Z self = 2025-05-07T20:32:12.4443112Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:12.4443617Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f06ab5de050>} 2025-05-07T20:32:12.4444370Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:12.4444714Z context = 2025-05-07T20:32:12.4444719Z 2025-05-07T20:32:12.4444891Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:12.4445153Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:12.4445272Z module_map=module_map) 2025-05-07T20:32:12.4445440Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:12.4445546Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:12.4445625Z E ^ 2025-05-07T20:32:12.4445983Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:12.4445987Z 2025-05-07T20:32:12.4446412Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:12.4446419Z 2025-05-07T20:32:12.4446525Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.4446759Z self=, 2025-05-07T20:32:12.4446840Z T=2048, 2025-05-07T20:32:12.4446917Z D=7168, 2025-05-07T20:32:12.4447008Z scale_ub=None, 2025-05-07T20:32:12.4447096Z contiguous=False, 2025-05-07T20:32:12.4447181Z compiled=True, 2025-05-07T20:32:12.4447260Z ) 2025-05-07T20:32:12.4447480Z self = 2025-05-07T20:32:12.4447656Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:12.4447660Z 2025-05-07T20:32:12.4447743Z @given( 2025-05-07T20:32:12.4447864Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.4447966Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.4448094Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.4448216Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.4448345Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.4448419Z ) 2025-05-07T20:32:12.4448669Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.4448771Z def test_silu_mul_quant( 2025-05-07T20:32:12.4448847Z self, 2025-05-07T20:32:12.4448927Z T: int, 2025-05-07T20:32:12.4449010Z D: int, 2025-05-07T20:32:12.4449111Z scale_ub: Optional[float], 2025-05-07T20:32:12.4449203Z contiguous: bool, 2025-05-07T20:32:12.4449299Z compiled: bool, 2025-05-07T20:32:12.4449378Z ) -> None: 2025-05-07T20:32:12.4449474Z torch.manual_seed(2025) 2025-05-07T20:32:12.4449555Z 2025-05-07T20:32:12.4449722Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.4449801Z 2025-05-07T20:32:12.4449894Z x_sign = torch.sign(x) 2025-05-07T20:32:12.4450026Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.4450125Z x = x_sign * x_clamp 2025-05-07T20:32:12.4450209Z x0 = x[:, :D] 2025-05-07T20:32:12.4450296Z x1 = x[:, D:] 2025-05-07T20:32:12.4450374Z 2025-05-07T20:32:12.4450461Z if contiguous: 2025-05-07T20:32:12.4450556Z x0 = x0.contiguous() 2025-05-07T20:32:12.4450653Z x1 = x1.contiguous() 2025-05-07T20:32:12.4450777Z 2025-05-07T20:32:12.4450872Z if scale_ub is not None: 2025-05-07T20:32:12.4450986Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:12.4451120Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:12.4451201Z ) 2025-05-07T20:32:12.4451276Z else: 2025-05-07T20:32:12.4451375Z scale_ub_tensor = None 2025-05-07T20:32:12.4451455Z 2025-05-07T20:32:12.4451588Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.4451681Z op = silu_mul_quant 2025-05-07T20:32:12.4451820Z if compiled: 2025-05-07T20:32:12.4451921Z op = torch.compile(op) 2025-05-07T20:32:12.4452106Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.4452191Z 2025-05-07T20:32:12.4452284Z > y_fp8, y_scale = fn() 2025-05-07T20:32:12.4452289Z 2025-05-07T20:32:12.4452389Z moe/activation_test.py:117: 2025-05-07T20:32:12.4452526Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.4452632Z moe/activation_test.py:115: in fn 2025-05-07T20:32:12.4452739Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.4453105Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:12.4453201Z return fn(*args, **kwargs) 2025-05-07T20:32:12.4453711Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:12.4453814Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:12.4454178Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:12.4454408Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:12.4454750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:12.4454856Z kernel = self.compile( 2025-05-07T20:32:12.4455239Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:12.4455415Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:12.4455549Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.4455554Z 2025-05-07T20:32:12.4455761Z self = 2025-05-07T20:32:12.4456545Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:12.4457052Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f06ab5df1c0>} 2025-05-07T20:32:12.4457821Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:12.4458012Z context = 2025-05-07T20:32:12.4458017Z 2025-05-07T20:32:12.4458185Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:12.4458455Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:12.4458567Z module_map=module_map) 2025-05-07T20:32:12.4458737Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:12.4458845Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:12.4458924Z E ^ 2025-05-07T20:32:12.4459290Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:12.4459338Z 2025-05-07T20:32:12.4459750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:12.4459755Z 2025-05-07T20:32:12.4459962Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.4460192Z self=, 2025-05-07T20:32:12.4460275Z T=4096, 2025-05-07T20:32:12.4460357Z D=7168, 2025-05-07T20:32:12.4460439Z scale_ub=None, 2025-05-07T20:32:12.4460527Z contiguous=False, 2025-05-07T20:32:12.4460618Z compiled=True, 2025-05-07T20:32:12.4460737Z ) 2025-05-07T20:32:12.4460959Z self = 2025-05-07T20:32:12.4461216Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:12.4461221Z 2025-05-07T20:32:12.4461301Z @given( 2025-05-07T20:32:12.4461421Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.4461531Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.4461649Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.4461774Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.4461894Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.4461969Z ) 2025-05-07T20:32:12.4462221Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.4462316Z def test_silu_mul_quant( 2025-05-07T20:32:12.4462393Z self, 2025-05-07T20:32:12.4462480Z T: int, 2025-05-07T20:32:12.4462558Z D: int, 2025-05-07T20:32:12.4462662Z scale_ub: Optional[float], 2025-05-07T20:32:12.4462760Z contiguous: bool, 2025-05-07T20:32:12.4462853Z compiled: bool, 2025-05-07T20:32:12.4462931Z ) -> None: 2025-05-07T20:32:12.4463037Z torch.manual_seed(2025) 2025-05-07T20:32:12.4463111Z 2025-05-07T20:32:12.4463281Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.4463364Z 2025-05-07T20:32:12.4463461Z x_sign = torch.sign(x) 2025-05-07T20:32:12.4463597Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.4463690Z x = x_sign * x_clamp 2025-05-07T20:32:12.4463773Z x0 = x[:, :D] 2025-05-07T20:32:12.4463863Z x1 = x[:, D:] 2025-05-07T20:32:12.4463937Z 2025-05-07T20:32:12.4464023Z if contiguous: 2025-05-07T20:32:12.4464127Z x0 = x0.contiguous() 2025-05-07T20:32:12.4464220Z x1 = x1.contiguous() 2025-05-07T20:32:12.4464294Z 2025-05-07T20:32:12.4464398Z if scale_ub is not None: 2025-05-07T20:32:12.4464504Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:12.4464648Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:12.4464732Z ) 2025-05-07T20:32:12.4464813Z else: 2025-05-07T20:32:12.4464919Z scale_ub_tensor = None 2025-05-07T20:32:12.4464995Z 2025-05-07T20:32:12.4465130Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.4465231Z op = silu_mul_quant 2025-05-07T20:32:12.4465318Z if compiled: 2025-05-07T20:32:12.4465423Z op = torch.compile(op) 2025-05-07T20:32:12.4465539Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.4465610Z 2025-05-07T20:32:12.4465704Z > y_fp8, y_scale = fn() 2025-05-07T20:32:12.4465709Z 2025-05-07T20:32:12.4465816Z moe/activation_test.py:117: 2025-05-07T20:32:12.4465944Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.4466074Z moe/activation_test.py:115: in fn 2025-05-07T20:32:12.4466177Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.4466548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:12.4466652Z return fn(*args, **kwargs) 2025-05-07T20:32:12.4467149Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:12.4467307Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:12.4467668Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:12.4467894Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:12.4468244Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:12.4468340Z kernel = self.compile( 2025-05-07T20:32:12.4468775Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:12.4469026Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:12.4469156Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.4469160Z 2025-05-07T20:32:12.4469377Z self = 2025-05-07T20:32:12.4470150Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:12.4470657Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f06aaf301f0>} 2025-05-07T20:32:12.4471405Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:12.4471602Z context = 2025-05-07T20:32:12.4471607Z 2025-05-07T20:32:12.4471783Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:12.4472048Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:12.4472167Z module_map=module_map) 2025-05-07T20:32:12.4472332Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:12.4472433Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:12.4472518Z E ^ 2025-05-07T20:32:12.4472872Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:12.4472877Z 2025-05-07T20:32:12.4473290Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:12.4473301Z 2025-05-07T20:32:12.4473413Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.4473636Z self=, 2025-05-07T20:32:12.4473723Z T=16384, 2025-05-07T20:32:12.4473801Z D=5120, 2025-05-07T20:32:12.4473891Z scale_ub=1200.0, 2025-05-07T20:32:12.4473986Z contiguous=False, 2025-05-07T20:32:12.4474069Z compiled=False, 2025-05-07T20:32:12.4474142Z ) 2025-05-07T20:32:12.4474368Z self = 2025-05-07T20:32:12.4474549Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:12.4474553Z 2025-05-07T20:32:12.4474639Z @given( 2025-05-07T20:32:12.4474760Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.4474858Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.4474987Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.4475111Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.4475228Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.4475309Z ) 2025-05-07T20:32:12.4475553Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.4475647Z def test_silu_mul_quant( 2025-05-07T20:32:12.4475778Z self, 2025-05-07T20:32:12.4475855Z T: int, 2025-05-07T20:32:12.4475942Z D: int, 2025-05-07T20:32:12.4476041Z scale_ub: Optional[float], 2025-05-07T20:32:12.4476132Z contiguous: bool, 2025-05-07T20:32:12.4476225Z compiled: bool, 2025-05-07T20:32:12.4476304Z ) -> None: 2025-05-07T20:32:12.4476400Z torch.manual_seed(2025) 2025-05-07T20:32:12.4476478Z 2025-05-07T20:32:12.4476649Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.4476766Z 2025-05-07T20:32:12.4476866Z x_sign = torch.sign(x) 2025-05-07T20:32:12.4476992Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.4477202Z x = x_sign * x_clamp 2025-05-07T20:32:12.4477290Z x0 = x[:, :D] 2025-05-07T20:32:12.4477370Z x1 = x[:, D:] 2025-05-07T20:32:12.4477445Z 2025-05-07T20:32:12.4477537Z if contiguous: 2025-05-07T20:32:12.4477631Z x0 = x0.contiguous() 2025-05-07T20:32:12.4477729Z x1 = x1.contiguous() 2025-05-07T20:32:12.4477804Z 2025-05-07T20:32:12.4477900Z if scale_ub is not None: 2025-05-07T20:32:12.4478013Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:12.4478150Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:12.4478227Z ) 2025-05-07T20:32:12.4478310Z else: 2025-05-07T20:32:12.4478408Z scale_ub_tensor = None 2025-05-07T20:32:12.4478480Z 2025-05-07T20:32:12.4478615Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.4478709Z op = silu_mul_quant 2025-05-07T20:32:12.4478796Z if compiled: 2025-05-07T20:32:12.4478910Z op = torch.compile(op) 2025-05-07T20:32:12.4479019Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.4479098Z 2025-05-07T20:32:12.4479192Z > y_fp8, y_scale = fn() 2025-05-07T20:32:12.4479196Z 2025-05-07T20:32:12.4479300Z moe/activation_test.py:117: 2025-05-07T20:32:12.4479438Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.4479540Z moe/activation_test.py:115: in fn 2025-05-07T20:32:12.4479641Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.4480143Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:12.4480244Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:12.4480607Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:12.4480840Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:12.4481189Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:12.4481290Z kernel = self.compile( 2025-05-07T20:32:12.4481677Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:12.4481860Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:12.4481995Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.4481999Z 2025-05-07T20:32:12.4482207Z self = 2025-05-07T20:32:12.4482983Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:12.4483496Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f06aaf30700>} 2025-05-07T20:32:12.4484248Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:12.4484490Z context = 2025-05-07T20:32:12.4484495Z 2025-05-07T20:32:12.4484662Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:12.4484934Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:12.4485044Z module_map=module_map) 2025-05-07T20:32:12.4485210Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:12.4485360Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:12.4485444Z E ^ 2025-05-07T20:32:12.4485894Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:12.4485900Z 2025-05-07T20:32:12.4486320Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:12.4486327Z 2025-05-07T20:32:12.4486433Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.4486662Z self=, 2025-05-07T20:32:12.4486741Z T=16384, 2025-05-07T20:32:12.4486827Z D=5120, 2025-05-07T20:32:12.4486913Z scale_ub=1200.0, 2025-05-07T20:32:12.4486999Z contiguous=True, 2025-05-07T20:32:12.4487090Z compiled=True, 2025-05-07T20:32:12.4487168Z ) 2025-05-07T20:32:12.4487385Z self = 2025-05-07T20:32:12.4487570Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:12.4487575Z 2025-05-07T20:32:12.4487661Z @given( 2025-05-07T20:32:12.4487781Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.4487889Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.4488006Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.4488137Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.4488255Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.4488329Z ) 2025-05-07T20:32:12.4488581Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.4488676Z def test_silu_mul_quant( 2025-05-07T20:32:12.4488753Z self, 2025-05-07T20:32:12.4488838Z T: int, 2025-05-07T20:32:12.4488914Z D: int, 2025-05-07T20:32:12.4489018Z scale_ub: Optional[float], 2025-05-07T20:32:12.4489117Z contiguous: bool, 2025-05-07T20:32:12.4489204Z compiled: bool, 2025-05-07T20:32:12.4489283Z ) -> None: 2025-05-07T20:32:12.4489389Z torch.manual_seed(2025) 2025-05-07T20:32:12.4489461Z 2025-05-07T20:32:12.4489642Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.4489712Z 2025-05-07T20:32:12.4489807Z x_sign = torch.sign(x) 2025-05-07T20:32:12.4490281Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.4490419Z x = x_sign * x_clamp 2025-05-07T20:32:12.4490529Z x0 = x[:, :D] 2025-05-07T20:32:12.4490619Z x1 = x[:, D:] 2025-05-07T20:32:12.4490694Z 2025-05-07T20:32:12.4490782Z if contiguous: 2025-05-07T20:32:12.4490883Z x0 = x0.contiguous() 2025-05-07T20:32:12.4490975Z x1 = x1.contiguous() 2025-05-07T20:32:12.4491051Z 2025-05-07T20:32:12.4491151Z if scale_ub is not None: 2025-05-07T20:32:12.4491259Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:12.4491402Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:12.4491489Z ) 2025-05-07T20:32:12.4491576Z else: 2025-05-07T20:32:12.4491684Z scale_ub_tensor = None 2025-05-07T20:32:12.4491760Z 2025-05-07T20:32:12.4491892Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.4491993Z op = silu_mul_quant 2025-05-07T20:32:12.4492256Z if compiled: 2025-05-07T20:32:12.4492360Z op = torch.compile(op) 2025-05-07T20:32:12.4492478Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.4492554Z 2025-05-07T20:32:12.4492649Z > y_fp8, y_scale = fn() 2025-05-07T20:32:12.4492653Z 2025-05-07T20:32:12.4492762Z moe/activation_test.py:117: 2025-05-07T20:32:12.4492892Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.4493002Z moe/activation_test.py:115: in fn 2025-05-07T20:32:12.4493106Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.4493669Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:12.4493773Z return fn(*args, **kwargs) 2025-05-07T20:32:12.4494267Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:12.4494368Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:12.4494737Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:12.4494961Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:12.4495310Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:12.4495409Z kernel = self.compile( 2025-05-07T20:32:12.4495791Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:12.4495979Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:12.4496113Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.4496118Z 2025-05-07T20:32:12.4496336Z self = 2025-05-07T20:32:12.4497112Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:12.4497616Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f06aaf317e0>} 2025-05-07T20:32:12.4498370Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:12.4498571Z context = 2025-05-07T20:32:12.4498575Z 2025-05-07T20:32:12.4498749Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:12.4499013Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:12.4499126Z module_map=module_map) 2025-05-07T20:32:12.4499296Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:12.4499400Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:12.4499480Z E ^ 2025-05-07T20:32:12.4499914Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:12.4499919Z 2025-05-07T20:32:12.4500331Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:12.4500339Z 2025-05-07T20:32:12.4500454Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.4500681Z self=, 2025-05-07T20:32:12.4500762Z T=16384, 2025-05-07T20:32:12.4500848Z D=5120, 2025-05-07T20:32:12.4500931Z scale_ub=None, 2025-05-07T20:32:12.4501024Z contiguous=False, 2025-05-07T20:32:12.4501110Z compiled=True, 2025-05-07T20:32:12.4501240Z ) 2025-05-07T20:32:12.4501463Z self = 2025-05-07T20:32:12.4501645Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:12.4501650Z 2025-05-07T20:32:12.4501728Z @given( 2025-05-07T20:32:12.4501855Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.4501955Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.4502072Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.4502198Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.4502357Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.4502436Z ) 2025-05-07T20:32:12.4502764Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.4502861Z def test_silu_mul_quant( 2025-05-07T20:32:12.4502941Z self, 2025-05-07T20:32:12.4503018Z T: int, 2025-05-07T20:32:12.4503098Z D: int, 2025-05-07T20:32:12.4503205Z scale_ub: Optional[float], 2025-05-07T20:32:12.4503296Z contiguous: bool, 2025-05-07T20:32:12.4503381Z compiled: bool, 2025-05-07T20:32:12.4503465Z ) -> None: 2025-05-07T20:32:12.4503560Z torch.manual_seed(2025) 2025-05-07T20:32:12.4503631Z 2025-05-07T20:32:12.4503806Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.4503879Z 2025-05-07T20:32:12.4503979Z x_sign = torch.sign(x) 2025-05-07T20:32:12.4504106Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.4504202Z x = x_sign * x_clamp 2025-05-07T20:32:12.4504286Z x0 = x[:, :D] 2025-05-07T20:32:12.4504370Z x1 = x[:, D:] 2025-05-07T20:32:12.4504442Z 2025-05-07T20:32:12.4504534Z if contiguous: 2025-05-07T20:32:12.4504628Z x0 = x0.contiguous() 2025-05-07T20:32:12.4504719Z x1 = x1.contiguous() 2025-05-07T20:32:12.4504799Z 2025-05-07T20:32:12.4504893Z if scale_ub is not None: 2025-05-07T20:32:12.4505000Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:12.4505139Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:12.4505216Z ) 2025-05-07T20:32:12.4505293Z else: 2025-05-07T20:32:12.4505396Z scale_ub_tensor = None 2025-05-07T20:32:12.4505470Z 2025-05-07T20:32:12.4505609Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.4505705Z op = silu_mul_quant 2025-05-07T20:32:12.4505792Z if compiled: 2025-05-07T20:32:12.4505904Z op = torch.compile(op) 2025-05-07T20:32:12.4506013Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.4506095Z 2025-05-07T20:32:12.4506195Z > y_fp8, y_scale = fn() 2025-05-07T20:32:12.4506200Z 2025-05-07T20:32:12.4506298Z moe/activation_test.py:117: 2025-05-07T20:32:12.4506426Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.4506537Z moe/activation_test.py:115: in fn 2025-05-07T20:32:12.4506638Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.4507016Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:12.4507113Z return fn(*args, **kwargs) 2025-05-07T20:32:12.4507613Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:12.4507720Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:12.4508086Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:12.4508314Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:12.4508661Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:12.4508759Z kernel = self.compile( 2025-05-07T20:32:12.4509201Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:12.4509378Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:12.4509506Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.4509510Z 2025-05-07T20:32:12.4509723Z self = 2025-05-07T20:32:12.4510603Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:12.4511150Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f06aaf32680>} 2025-05-07T20:32:12.4511908Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:12.4512108Z context = 2025-05-07T20:32:12.4512113Z 2025-05-07T20:32:12.4512284Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:12.4512546Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:12.4512662Z module_map=module_map) 2025-05-07T20:32:12.4512829Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:12.4512938Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:12.4513023Z E ^ 2025-05-07T20:32:12.4513378Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:12.4513383Z 2025-05-07T20:32:12.4513802Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:12.4513809Z 2025-05-07T20:32:12.4513916Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.4514139Z self=, 2025-05-07T20:32:12.4514226Z T=2048, 2025-05-07T20:32:12.4514304Z D=5120, 2025-05-07T20:32:12.4514385Z scale_ub=None, 2025-05-07T20:32:12.4514479Z contiguous=False, 2025-05-07T20:32:12.4514562Z compiled=True, 2025-05-07T20:32:12.4514645Z ) 2025-05-07T20:32:12.4514866Z self = 2025-05-07T20:32:12.4515046Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:12.4515050Z 2025-05-07T20:32:12.4515135Z @given( 2025-05-07T20:32:12.4515256Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.4515357Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.4515483Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.4515601Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.4515716Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.4515796Z ) 2025-05-07T20:32:12.4516044Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.4516146Z def test_silu_mul_quant( 2025-05-07T20:32:12.4516222Z self, 2025-05-07T20:32:12.4516299Z T: int, 2025-05-07T20:32:12.4516383Z D: int, 2025-05-07T20:32:12.4516489Z scale_ub: Optional[float], 2025-05-07T20:32:12.4516582Z contiguous: bool, 2025-05-07T20:32:12.4516674Z compiled: bool, 2025-05-07T20:32:12.4516757Z ) -> None: 2025-05-07T20:32:12.4516855Z torch.manual_seed(2025) 2025-05-07T20:32:12.4516932Z 2025-05-07T20:32:12.4517100Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.4517173Z 2025-05-07T20:32:12.4517321Z x_sign = torch.sign(x) 2025-05-07T20:32:12.4517446Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.4517542Z x = x_sign * x_clamp 2025-05-07T20:32:12.4517622Z x0 = x[:, :D] 2025-05-07T20:32:12.4517704Z x1 = x[:, D:] 2025-05-07T20:32:12.4517781Z 2025-05-07T20:32:12.4517867Z if contiguous: 2025-05-07T20:32:12.4517963Z x0 = x0.contiguous() 2025-05-07T20:32:12.4518061Z x1 = x1.contiguous() 2025-05-07T20:32:12.4518133Z 2025-05-07T20:32:12.4518224Z if scale_ub is not None: 2025-05-07T20:32:12.4518382Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:12.4518590Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:12.4518667Z ) 2025-05-07T20:32:12.4518756Z else: 2025-05-07T20:32:12.4518853Z scale_ub_tensor = None 2025-05-07T20:32:12.4518928Z 2025-05-07T20:32:12.4519066Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.4519160Z op = silu_mul_quant 2025-05-07T20:32:12.4519253Z if compiled: 2025-05-07T20:32:12.4519357Z op = torch.compile(op) 2025-05-07T20:32:12.4519465Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.4519548Z 2025-05-07T20:32:12.4519640Z > y_fp8, y_scale = fn() 2025-05-07T20:32:12.4519645Z 2025-05-07T20:32:12.4519743Z moe/activation_test.py:117: 2025-05-07T20:32:12.4519879Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.4519981Z moe/activation_test.py:115: in fn 2025-05-07T20:32:12.4520083Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.4520468Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:12.4520563Z return fn(*args, **kwargs) 2025-05-07T20:32:12.4521062Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:12.4521164Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:12.4521525Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:12.4521758Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:12.4522104Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:12.4522206Z kernel = self.compile( 2025-05-07T20:32:12.4522588Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:12.4522772Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:12.4522905Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.4522910Z 2025-05-07T20:32:12.4523116Z self = 2025-05-07T20:32:12.4523898Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:12.4524404Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f06aaf32560>} 2025-05-07T20:32:12.4525150Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:12.4525356Z context = 2025-05-07T20:32:12.4525361Z 2025-05-07T20:32:12.4525528Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:12.4525798Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:12.4525958Z module_map=module_map) 2025-05-07T20:32:12.4526121Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:12.4526230Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:12.4526309Z E ^ 2025-05-07T20:32:12.4526665Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:12.4526676Z 2025-05-07T20:32:12.4527087Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:12.4527132Z 2025-05-07T20:32:12.4527317Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.4527544Z self=, 2025-05-07T20:32:12.4527625Z T=2048, 2025-05-07T20:32:12.4527705Z D=5120, 2025-05-07T20:32:12.4527796Z scale_ub=1200.0, 2025-05-07T20:32:12.4527887Z contiguous=False, 2025-05-07T20:32:12.4527972Z compiled=True, 2025-05-07T20:32:12.4528050Z ) 2025-05-07T20:32:12.4528264Z self = 2025-05-07T20:32:12.4528447Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:12.4528452Z 2025-05-07T20:32:12.4528529Z @given( 2025-05-07T20:32:12.4528651Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.4528760Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.4528878Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.4528999Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.4529127Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.4529201Z ) 2025-05-07T20:32:12.4529446Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.4529549Z def test_silu_mul_quant( 2025-05-07T20:32:12.4529627Z self, 2025-05-07T20:32:12.4529732Z T: int, 2025-05-07T20:32:12.4529809Z D: int, 2025-05-07T20:32:12.4529911Z scale_ub: Optional[float], 2025-05-07T20:32:12.4530009Z contiguous: bool, 2025-05-07T20:32:12.4535762Z compiled: bool, 2025-05-07T20:32:12.4535868Z ) -> None: 2025-05-07T20:32:12.4535980Z torch.manual_seed(2025) 2025-05-07T20:32:12.4536058Z 2025-05-07T20:32:12.4536238Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.4536325Z 2025-05-07T20:32:12.4536423Z x_sign = torch.sign(x) 2025-05-07T20:32:12.4536566Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.4536669Z x = x_sign * x_clamp 2025-05-07T20:32:12.4536760Z x0 = x[:, :D] 2025-05-07T20:32:12.4536845Z x1 = x[:, D:] 2025-05-07T20:32:12.4536930Z 2025-05-07T20:32:12.4537019Z if contiguous: 2025-05-07T20:32:12.4537115Z x0 = x0.contiguous() 2025-05-07T20:32:12.4537215Z x1 = x1.contiguous() 2025-05-07T20:32:12.4537296Z 2025-05-07T20:32:12.4537400Z if scale_ub is not None: 2025-05-07T20:32:12.4537513Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:12.4537653Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:12.4537740Z ) 2025-05-07T20:32:12.4537821Z else: 2025-05-07T20:32:12.4537919Z scale_ub_tensor = None 2025-05-07T20:32:12.4538002Z 2025-05-07T20:32:12.4538137Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.4538232Z op = silu_mul_quant 2025-05-07T20:32:12.4538332Z if compiled: 2025-05-07T20:32:12.4538437Z op = torch.compile(op) 2025-05-07T20:32:12.4538553Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.4538636Z 2025-05-07T20:32:12.4538733Z > y_fp8, y_scale = fn() 2025-05-07T20:32:12.4538739Z 2025-05-07T20:32:12.4538849Z moe/activation_test.py:117: 2025-05-07T20:32:12.4539063Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.4539168Z moe/activation_test.py:115: in fn 2025-05-07T20:32:12.4539281Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.4539657Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:12.4539757Z return fn(*args, **kwargs) 2025-05-07T20:32:12.4540392Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:12.4540549Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:12.4540998Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:12.4541233Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:12.4541584Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:12.4541695Z kernel = self.compile( 2025-05-07T20:32:12.4542087Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:12.4542268Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:12.4542407Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.4542412Z 2025-05-07T20:32:12.4542627Z self = 2025-05-07T20:32:12.4543432Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:12.4543941Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f06aaf33370>} 2025-05-07T20:32:12.4544700Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:12.4544895Z context = 2025-05-07T20:32:12.4544900Z 2025-05-07T20:32:12.4545068Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:12.4545344Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:12.4545462Z module_map=module_map) 2025-05-07T20:32:12.4545642Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:12.4545747Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:12.4545829Z E ^ 2025-05-07T20:32:12.4546200Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:12.4546207Z 2025-05-07T20:32:12.4546630Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:12.4546635Z 2025-05-07T20:32:12.4546752Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.4546981Z self=, 2025-05-07T20:32:12.4547063Z T=4096, 2025-05-07T20:32:12.4547151Z D=5120, 2025-05-07T20:32:12.4547241Z scale_ub=1200.0, 2025-05-07T20:32:12.4547331Z contiguous=True, 2025-05-07T20:32:12.4547433Z compiled=True, 2025-05-07T20:32:12.4547515Z ) 2025-05-07T20:32:12.4547740Z self = 2025-05-07T20:32:12.4547928Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:12.4547933Z 2025-05-07T20:32:12.4548014Z @given( 2025-05-07T20:32:12.4548138Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.4548327Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.4548450Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.4548579Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.4548701Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.4548779Z ) 2025-05-07T20:32:12.4549043Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.4549142Z def test_silu_mul_quant( 2025-05-07T20:32:12.4549223Z self, 2025-05-07T20:32:12.4549313Z T: int, 2025-05-07T20:32:12.4549438Z D: int, 2025-05-07T20:32:12.4549543Z scale_ub: Optional[float], 2025-05-07T20:32:12.4549718Z contiguous: bool, 2025-05-07T20:32:12.4549812Z compiled: bool, 2025-05-07T20:32:12.4549905Z ) -> None: 2025-05-07T20:32:12.4550006Z torch.manual_seed(2025) 2025-05-07T20:32:12.4550083Z 2025-05-07T20:32:12.4550265Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.4550346Z 2025-05-07T20:32:12.4550442Z x_sign = torch.sign(x) 2025-05-07T20:32:12.4550579Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.4550674Z x = x_sign * x_clamp 2025-05-07T20:32:12.4550759Z x0 = x[:, :D] 2025-05-07T20:32:12.4550855Z x1 = x[:, D:] 2025-05-07T20:32:12.4550934Z 2025-05-07T20:32:12.4551023Z if contiguous: 2025-05-07T20:32:12.4551129Z x0 = x0.contiguous() 2025-05-07T20:32:12.4551223Z x1 = x1.contiguous() 2025-05-07T20:32:12.4551303Z 2025-05-07T20:32:12.4551407Z if scale_ub is not None: 2025-05-07T20:32:12.4551521Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:12.4551674Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:12.4551756Z ) 2025-05-07T20:32:12.4551836Z else: 2025-05-07T20:32:12.4551941Z scale_ub_tensor = None 2025-05-07T20:32:12.4552019Z 2025-05-07T20:32:12.4552159Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.4552260Z op = silu_mul_quant 2025-05-07T20:32:12.4552350Z if compiled: 2025-05-07T20:32:12.4552455Z op = torch.compile(op) 2025-05-07T20:32:12.4552576Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.4552652Z 2025-05-07T20:32:12.4552750Z > y_fp8, y_scale = fn() 2025-05-07T20:32:12.4552761Z 2025-05-07T20:32:12.4552863Z moe/activation_test.py:117: 2025-05-07T20:32:12.4552996Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.4553110Z moe/activation_test.py:115: in fn 2025-05-07T20:32:12.4553220Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.4553598Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:12.4553702Z return fn(*args, **kwargs) 2025-05-07T20:32:12.4554239Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:12.4554363Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:12.4554732Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:12.4554958Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:12.4555313Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:12.4555415Z kernel = self.compile( 2025-05-07T20:32:12.4555812Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:12.4555997Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:12.4556127Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.4556132Z 2025-05-07T20:32:12.4556405Z self = 2025-05-07T20:32:12.4557187Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:12.4557691Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f06ab01c310>} 2025-05-07T20:32:12.4558528Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:12.4558761Z context = 2025-05-07T20:32:12.4558766Z 2025-05-07T20:32:12.4558944Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:12.4559219Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:12.4559332Z module_map=module_map) 2025-05-07T20:32:12.4559504Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:12.4559608Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:12.4559690Z E ^ 2025-05-07T20:32:12.4560056Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:12.4560060Z 2025-05-07T20:32:12.4560478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:12.4560487Z 2025-05-07T20:32:12.4560603Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.4560828Z self=, 2025-05-07T20:32:12.4560910Z T=128, 2025-05-07T20:32:12.4560996Z D=5120, 2025-05-07T20:32:12.4561088Z scale_ub=1200.0, 2025-05-07T20:32:12.4561182Z contiguous=False, 2025-05-07T20:32:12.4561276Z compiled=True, 2025-05-07T20:32:12.4561354Z ) 2025-05-07T20:32:12.4561574Z self = 2025-05-07T20:32:12.4561757Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:12.4561762Z 2025-05-07T20:32:12.4561843Z @given( 2025-05-07T20:32:12.4561973Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.4562077Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.4562201Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.4562335Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.4562456Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.4562535Z ) 2025-05-07T20:32:12.4562790Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.4562888Z def test_silu_mul_quant( 2025-05-07T20:32:12.4562978Z self, 2025-05-07T20:32:12.4563059Z T: int, 2025-05-07T20:32:12.4563140Z D: int, 2025-05-07T20:32:12.4563250Z scale_ub: Optional[float], 2025-05-07T20:32:12.4563346Z contiguous: bool, 2025-05-07T20:32:12.4563437Z compiled: bool, 2025-05-07T20:32:12.4563526Z ) -> None: 2025-05-07T20:32:12.4563624Z torch.manual_seed(2025) 2025-05-07T20:32:12.4563701Z 2025-05-07T20:32:12.4563883Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.4563963Z 2025-05-07T20:32:12.4564057Z x_sign = torch.sign(x) 2025-05-07T20:32:12.4564194Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.4564292Z x = x_sign * x_clamp 2025-05-07T20:32:12.4564388Z x0 = x[:, :D] 2025-05-07T20:32:12.4564473Z x1 = x[:, D:] 2025-05-07T20:32:12.4564549Z 2025-05-07T20:32:12.4564643Z if contiguous: 2025-05-07T20:32:12.4564791Z x0 = x0.contiguous() 2025-05-07T20:32:12.4564883Z x1 = x1.contiguous() 2025-05-07T20:32:12.4564965Z 2025-05-07T20:32:12.4565061Z if scale_ub is not None: 2025-05-07T20:32:12.4565173Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:12.4565318Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:12.4565396Z ) 2025-05-07T20:32:12.4565476Z else: 2025-05-07T20:32:12.4565581Z scale_ub_tensor = None 2025-05-07T20:32:12.4565659Z 2025-05-07T20:32:12.4565792Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.4565938Z op = silu_mul_quant 2025-05-07T20:32:12.4566028Z if compiled: 2025-05-07T20:32:12.4566301Z op = torch.compile(op) 2025-05-07T20:32:12.4566416Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.4566493Z 2025-05-07T20:32:12.4566597Z > y_fp8, y_scale = fn() 2025-05-07T20:32:12.4566602Z 2025-05-07T20:32:12.4566710Z moe/activation_test.py:117: 2025-05-07T20:32:12.4566840Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.4566950Z moe/activation_test.py:115: in fn 2025-05-07T20:32:12.4567055Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.4567424Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:12.4567528Z return fn(*args, **kwargs) 2025-05-07T20:32:12.4568026Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:12.4568137Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:12.4568506Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:12.4568730Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:12.4569084Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:12.4569184Z kernel = self.compile( 2025-05-07T20:32:12.4569581Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:12.4569761Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:12.4569891Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.4569895Z 2025-05-07T20:32:12.4570118Z self = 2025-05-07T20:32:12.4570904Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:12.4571413Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f06ab01d090>} 2025-05-07T20:32:12.4572162Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:12.4572361Z context = 2025-05-07T20:32:12.4572365Z 2025-05-07T20:32:12.4572541Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:12.4572807Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:12.4572932Z module_map=module_map) 2025-05-07T20:32:12.4573100Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:12.4573202Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:12.4573288Z E ^ 2025-05-07T20:32:12.4573649Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:12.4573698Z 2025-05-07T20:32:12.4574131Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:12.4574136Z 2025-05-07T20:32:12.4574245Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.4574470Z self=, 2025-05-07T20:32:12.4574559Z T=16384, 2025-05-07T20:32:12.4574639Z D=7168, 2025-05-07T20:32:12.4574728Z scale_ub=1200.0, 2025-05-07T20:32:12.4574866Z contiguous=True, 2025-05-07T20:32:12.4574954Z compiled=True, 2025-05-07T20:32:12.4575031Z ) 2025-05-07T20:32:12.4575358Z self = 2025-05-07T20:32:12.4575545Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:12.4575549Z 2025-05-07T20:32:12.4575638Z @given( 2025-05-07T20:32:12.4575765Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.4575868Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.4575993Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.4576115Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.4576234Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.4576319Z ) 2025-05-07T20:32:12.4576573Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.4576671Z def test_silu_mul_quant( 2025-05-07T20:32:12.4576766Z self, 2025-05-07T20:32:12.4576847Z T: int, 2025-05-07T20:32:12.4576934Z D: int, 2025-05-07T20:32:12.4577044Z scale_ub: Optional[float], 2025-05-07T20:32:12.4577139Z contiguous: bool, 2025-05-07T20:32:12.4577237Z compiled: bool, 2025-05-07T20:32:12.4577320Z ) -> None: 2025-05-07T20:32:12.4577419Z torch.manual_seed(2025) 2025-05-07T20:32:12.4577501Z 2025-05-07T20:32:12.4577677Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.4577754Z 2025-05-07T20:32:12.4577858Z x_sign = torch.sign(x) 2025-05-07T20:32:12.4577987Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.4578081Z x = x_sign * x_clamp 2025-05-07T20:32:12.4578173Z x0 = x[:, :D] 2025-05-07T20:32:12.4578256Z x1 = x[:, D:] 2025-05-07T20:32:12.4578336Z 2025-05-07T20:32:12.4578431Z if contiguous: 2025-05-07T20:32:12.4578526Z x0 = x0.contiguous() 2025-05-07T20:32:12.4578627Z x1 = x1.contiguous() 2025-05-07T20:32:12.4578703Z 2025-05-07T20:32:12.4578797Z if scale_ub is not None: 2025-05-07T20:32:12.4578919Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:12.4579059Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:12.4579138Z ) 2025-05-07T20:32:12.4579228Z else: 2025-05-07T20:32:12.4579328Z scale_ub_tensor = None 2025-05-07T20:32:12.4579407Z 2025-05-07T20:32:12.4579549Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.4579648Z op = silu_mul_quant 2025-05-07T20:32:12.4579740Z if compiled: 2025-05-07T20:32:12.4579996Z op = torch.compile(op) 2025-05-07T20:32:12.4580108Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.4580191Z 2025-05-07T20:32:12.4580286Z > y_fp8, y_scale = fn() 2025-05-07T20:32:12.4580291Z 2025-05-07T20:32:12.4580393Z moe/activation_test.py:117: 2025-05-07T20:32:12.4580533Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.4580642Z moe/activation_test.py:115: in fn 2025-05-07T20:32:12.4580745Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.4581120Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:12.4581218Z return fn(*args, **kwargs) 2025-05-07T20:32:12.4581783Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:12.4581886Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:12.4582246Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:12.4582481Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:12.4582822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:12.4582964Z kernel = self.compile( 2025-05-07T20:32:12.4583436Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:12.4583619Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:12.4583756Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.4583763Z 2025-05-07T20:32:12.4583972Z self = 2025-05-07T20:32:12.4584747Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:12.4585255Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f06ab01e290>} 2025-05-07T20:32:12.4586012Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:12.4586211Z context = 2025-05-07T20:32:12.4586216Z 2025-05-07T20:32:12.4586388Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:12.4586660Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:12.4586771Z module_map=module_map) 2025-05-07T20:32:12.4586937Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:12.4587044Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:12.4587125Z E ^ 2025-05-07T20:32:12.4587490Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:12.4587498Z 2025-05-07T20:32:12.4587929Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:12.4587933Z 2025-05-07T20:32:12.4588042Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.4588274Z self=, 2025-05-07T20:32:12.4588359Z T=16384, 2025-05-07T20:32:12.4588440Z D=5120, 2025-05-07T20:32:12.4588532Z scale_ub=1200.0, 2025-05-07T20:32:12.4588621Z contiguous=True, 2025-05-07T20:32:12.4588707Z compiled=False, 2025-05-07T20:32:12.4588790Z ) 2025-05-07T20:32:12.4589009Z self = 2025-05-07T20:32:12.4589189Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:12.4589200Z 2025-05-07T20:32:12.4589281Z @given( 2025-05-07T20:32:12.4589404Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.4589516Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.4589642Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.4589765Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.4590249Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.4590363Z ) 2025-05-07T20:32:12.4590707Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.4591005Z def test_silu_mul_quant( 2025-05-07T20:32:12.4591088Z self, 2025-05-07T20:32:12.4591169Z T: int, 2025-05-07T20:32:12.4591260Z D: int, 2025-05-07T20:32:12.4591363Z scale_ub: Optional[float], 2025-05-07T20:32:12.4591462Z contiguous: bool, 2025-05-07T20:32:12.4591552Z compiled: bool, 2025-05-07T20:32:12.4591634Z ) -> None: 2025-05-07T20:32:12.4591739Z torch.manual_seed(2025) 2025-05-07T20:32:12.4591813Z 2025-05-07T20:32:12.4591987Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.4592151Z 2025-05-07T20:32:12.4592247Z x_sign = torch.sign(x) 2025-05-07T20:32:12.4592497Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.4592603Z x = x_sign * x_clamp 2025-05-07T20:32:12.4592687Z x0 = x[:, :D] 2025-05-07T20:32:12.4592771Z x1 = x[:, D:] 2025-05-07T20:32:12.4592854Z 2025-05-07T20:32:12.4592945Z if contiguous: 2025-05-07T20:32:12.4593039Z x0 = x0.contiguous() 2025-05-07T20:32:12.4593139Z x1 = x1.contiguous() 2025-05-07T20:32:12.4593215Z 2025-05-07T20:32:12.4593317Z if scale_ub is not None: 2025-05-07T20:32:12.4593426Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:12.4593564Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:12.4593653Z ) 2025-05-07T20:32:12.4593734Z else: 2025-05-07T20:32:12.4593833Z scale_ub_tensor = None 2025-05-07T20:32:12.4593919Z 2025-05-07T20:32:12.4594055Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.4594154Z op = silu_mul_quant 2025-05-07T20:32:12.4594249Z if compiled: 2025-05-07T20:32:12.4594353Z op = torch.compile(op) 2025-05-07T20:32:12.4594465Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.4594551Z 2025-05-07T20:32:12.4594652Z > y_fp8, y_scale = fn() 2025-05-07T20:32:12.4594657Z 2025-05-07T20:32:12.4594766Z moe/activation_test.py:117: 2025-05-07T20:32:12.4594898Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.4595002Z moe/activation_test.py:115: in fn 2025-05-07T20:32:12.4595110Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.4595609Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:12.4595713Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:12.4596090Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:12.4596324Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:12.4596677Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:12.4596776Z kernel = self.compile( 2025-05-07T20:32:12.4597170Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:12.4597367Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:12.4597497Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.4597501Z 2025-05-07T20:32:12.4597718Z self = 2025-05-07T20:32:12.4598496Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:12.4599008Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f06ab01d1b0>} 2025-05-07T20:32:12.4599754Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:12.4599998Z context = 2025-05-07T20:32:12.4600009Z 2025-05-07T20:32:12.4600177Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:12.4600442Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:12.4600559Z module_map=module_map) 2025-05-07T20:32:12.4600769Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:12.4600949Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:12.4601040Z E ^ 2025-05-07T20:32:12.4601403Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:12.4601407Z 2025-05-07T20:32:12.4601831Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:12.4601836Z 2025-05-07T20:32:12.4601949Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.4602173Z self=, 2025-05-07T20:32:12.4602262Z T=1, 2025-05-07T20:32:12.4602344Z D=7168, 2025-05-07T20:32:12.4602431Z scale_ub=1200.0, 2025-05-07T20:32:12.4602529Z contiguous=False, 2025-05-07T20:32:12.4602617Z compiled=False, 2025-05-07T20:32:12.4602698Z ) 2025-05-07T20:32:12.4602926Z self = 2025-05-07T20:32:12.4603105Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:12.4603109Z 2025-05-07T20:32:12.4603198Z @given( 2025-05-07T20:32:12.4603320Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.4603422Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.4603552Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.4603673Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.4603790Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.4603875Z ) 2025-05-07T20:32:12.4604149Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.4604264Z def test_silu_mul_quant( 2025-05-07T20:32:12.4604361Z self, 2025-05-07T20:32:12.4604442Z T: int, 2025-05-07T20:32:12.4604528Z D: int, 2025-05-07T20:32:12.4604636Z scale_ub: Optional[float], 2025-05-07T20:32:12.4604729Z contiguous: bool, 2025-05-07T20:32:12.4604828Z compiled: bool, 2025-05-07T20:32:12.4604910Z ) -> None: 2025-05-07T20:32:12.4605009Z torch.manual_seed(2025) 2025-05-07T20:32:12.4605090Z 2025-05-07T20:32:12.4605261Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.4605338Z 2025-05-07T20:32:12.4605445Z x_sign = torch.sign(x) 2025-05-07T20:32:12.4605573Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.4605666Z x = x_sign * x_clamp 2025-05-07T20:32:12.4605759Z x0 = x[:, :D] 2025-05-07T20:32:12.4605844Z x1 = x[:, D:] 2025-05-07T20:32:12.4605926Z 2025-05-07T20:32:12.4606014Z if contiguous: 2025-05-07T20:32:12.4606109Z x0 = x0.contiguous() 2025-05-07T20:32:12.4606208Z x1 = x1.contiguous() 2025-05-07T20:32:12.4606284Z 2025-05-07T20:32:12.4606379Z if scale_ub is not None: 2025-05-07T20:32:12.4606498Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:12.4606640Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:12.4606724Z ) 2025-05-07T20:32:12.4606812Z else: 2025-05-07T20:32:12.4606910Z scale_ub_tensor = None 2025-05-07T20:32:12.4606987Z 2025-05-07T20:32:12.4607131Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.4607278Z op = silu_mul_quant 2025-05-07T20:32:12.4607372Z if compiled: 2025-05-07T20:32:12.4607482Z op = torch.compile(op) 2025-05-07T20:32:12.4607596Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.4607680Z 2025-05-07T20:32:12.4607776Z > y_fp8, y_scale = fn() 2025-05-07T20:32:12.4607780Z 2025-05-07T20:32:12.4607882Z moe/activation_test.py:117: 2025-05-07T20:32:12.4608019Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.4608170Z moe/activation_test.py:115: in fn 2025-05-07T20:32:12.4608274Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.4608882Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:12.4608989Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:12.4609361Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:12.4609593Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:12.4609936Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:12.4610046Z kernel = self.compile( 2025-05-07T20:32:12.4610429Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:12.4610606Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:12.4610745Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.4610750Z 2025-05-07T20:32:12.4610964Z self = 2025-05-07T20:32:12.4611745Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:12.4612261Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f06ab01e680>} 2025-05-07T20:32:12.4613012Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:12.4613205Z context = 2025-05-07T20:32:12.4613213Z 2025-05-07T20:32:12.4613390Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:12.4613662Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:12.4613773Z module_map=module_map) 2025-05-07T20:32:12.4613946Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:12.4614051Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:12.4614132Z E ^ 2025-05-07T20:32:12.4614495Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:12.4614499Z 2025-05-07T20:32:12.4614913Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:12.4614917Z 2025-05-07T20:32:12.4615024Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.4615258Z self=, 2025-05-07T20:32:12.4615340Z T=4096, 2025-05-07T20:32:12.4615433Z D=7168, 2025-05-07T20:32:12.4615523Z scale_ub=1200.0, 2025-05-07T20:32:12.4615615Z contiguous=False, 2025-05-07T20:32:12.4615710Z compiled=True, 2025-05-07T20:32:12.4615785Z ) 2025-05-07T20:32:12.4616008Z self = 2025-05-07T20:32:12.4616241Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:12.4616246Z 2025-05-07T20:32:12.4616327Z @given( 2025-05-07T20:32:12.4616449Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.4616560Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.4616679Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.4616806Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.4616926Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.4617045Z ) 2025-05-07T20:32:12.4617298Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.4617483Z def test_silu_mul_quant( 2025-05-07T20:32:12.4617565Z self, 2025-05-07T20:32:12.4617653Z T: int, 2025-05-07T20:32:12.4617733Z D: int, 2025-05-07T20:32:12.4617839Z scale_ub: Optional[float], 2025-05-07T20:32:12.4617944Z contiguous: bool, 2025-05-07T20:32:12.4618041Z compiled: bool, 2025-05-07T20:32:12.4618123Z ) -> None: 2025-05-07T20:32:12.4618227Z torch.manual_seed(2025) 2025-05-07T20:32:12.4618304Z 2025-05-07T20:32:12.4618482Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.4618560Z 2025-05-07T20:32:12.4618657Z x_sign = torch.sign(x) 2025-05-07T20:32:12.4618795Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.4618890Z x = x_sign * x_clamp 2025-05-07T20:32:12.4618976Z x0 = x[:, :D] 2025-05-07T20:32:12.4619070Z x1 = x[:, D:] 2025-05-07T20:32:12.4619147Z 2025-05-07T20:32:12.4619236Z if contiguous: 2025-05-07T20:32:12.4619343Z x0 = x0.contiguous() 2025-05-07T20:32:12.4619440Z x1 = x1.contiguous() 2025-05-07T20:32:12.4619518Z 2025-05-07T20:32:12.4619618Z if scale_ub is not None: 2025-05-07T20:32:12.4619728Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:12.4619976Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:12.4620058Z ) 2025-05-07T20:32:12.4620139Z else: 2025-05-07T20:32:12.4620244Z scale_ub_tensor = None 2025-05-07T20:32:12.4620319Z 2025-05-07T20:32:12.4620453Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.4620552Z op = silu_mul_quant 2025-05-07T20:32:12.4620641Z if compiled: 2025-05-07T20:32:12.4620745Z op = torch.compile(op) 2025-05-07T20:32:12.4620861Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.4620941Z 2025-05-07T20:32:12.4621036Z > y_fp8, y_scale = fn() 2025-05-07T20:32:12.4621053Z 2025-05-07T20:32:12.4621157Z moe/activation_test.py:117: 2025-05-07T20:32:12.4621289Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.4621400Z moe/activation_test.py:115: in fn 2025-05-07T20:32:12.4621505Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.4621883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:12.4621988Z return fn(*args, **kwargs) 2025-05-07T20:32:12.4622494Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:12.4622596Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:12.4622961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:12.4623191Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:12.4623543Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:12.4623647Z kernel = self.compile( 2025-05-07T20:32:12.4624036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:12.4624272Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:12.4624402Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.4624407Z 2025-05-07T20:32:12.4624618Z self = 2025-05-07T20:32:12.4625407Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:12.4626118Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f06ab01fb50>} 2025-05-07T20:32:12.4626888Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:12.4627088Z context = 2025-05-07T20:32:12.4627093Z 2025-05-07T20:32:12.4627268Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:12.4627532Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:12.4627642Z module_map=module_map) 2025-05-07T20:32:12.4627813Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:12.4627918Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:12.4628004Z E ^ 2025-05-07T20:32:12.4628366Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:12.4628370Z 2025-05-07T20:32:12.4628785Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:12.4628792Z 2025-05-07T20:32:12.4628907Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.4629131Z self=, 2025-05-07T20:32:12.4629221Z T=128, 2025-05-07T20:32:12.4629304Z D=7168, 2025-05-07T20:32:12.4629391Z scale_ub=1200.0, 2025-05-07T20:32:12.4629488Z contiguous=False, 2025-05-07T20:32:12.4629575Z compiled=True, 2025-05-07T20:32:12.4629652Z ) 2025-05-07T20:32:12.4629876Z self = 2025-05-07T20:32:12.4630055Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:12.4630060Z 2025-05-07T20:32:12.4630142Z @given( 2025-05-07T20:32:12.4630276Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.4630379Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.4630501Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.4630628Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.4630750Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.4630833Z ) 2025-05-07T20:32:12.4631081Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.4631178Z def test_silu_mul_quant( 2025-05-07T20:32:12.4631263Z self, 2025-05-07T20:32:12.4631343Z T: int, 2025-05-07T20:32:12.4631423Z D: int, 2025-05-07T20:32:12.4631533Z scale_ub: Optional[float], 2025-05-07T20:32:12.4631627Z contiguous: bool, 2025-05-07T20:32:12.4631719Z compiled: bool, 2025-05-07T20:32:12.4631808Z ) -> None: 2025-05-07T20:32:12.4631907Z torch.manual_seed(2025) 2025-05-07T20:32:12.4631988Z 2025-05-07T20:32:12.4632164Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.4632241Z 2025-05-07T20:32:12.4632342Z x_sign = torch.sign(x) 2025-05-07T20:32:12.4632469Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.4632617Z x = x_sign * x_clamp 2025-05-07T20:32:12.4632707Z x0 = x[:, :D] 2025-05-07T20:32:12.4632791Z x1 = x[:, D:] 2025-05-07T20:32:12.4632865Z 2025-05-07T20:32:12.4632957Z if contiguous: 2025-05-07T20:32:12.4633054Z x0 = x0.contiguous() 2025-05-07T20:32:12.4633147Z x1 = x1.contiguous() 2025-05-07T20:32:12.4633230Z 2025-05-07T20:32:12.4633325Z if scale_ub is not None: 2025-05-07T20:32:12.4633434Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:12.4633582Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:12.4633708Z ) 2025-05-07T20:32:12.4633794Z else: 2025-05-07T20:32:12.4633966Z scale_ub_tensor = None 2025-05-07T20:32:12.4634043Z 2025-05-07T20:32:12.4634185Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.4634279Z op = silu_mul_quant 2025-05-07T20:32:12.4634368Z if compiled: 2025-05-07T20:32:12.4634481Z op = torch.compile(op) 2025-05-07T20:32:12.4634592Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.4634668Z 2025-05-07T20:32:12.4634770Z > y_fp8, y_scale = fn() 2025-05-07T20:32:12.4634774Z 2025-05-07T20:32:12.4634876Z moe/activation_test.py:117: 2025-05-07T20:32:12.4635012Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.4635116Z moe/activation_test.py:115: in fn 2025-05-07T20:32:12.4635218Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.4635601Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:12.4635705Z return fn(*args, **kwargs) 2025-05-07T20:32:12.4636211Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:12.4636319Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:12.4636686Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:12.4636916Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:12.4637259Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:12.4637357Z kernel = self.compile( 2025-05-07T20:32:12.4637752Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:12.4637933Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:12.4638070Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.4638075Z 2025-05-07T20:32:12.4638288Z self = 2025-05-07T20:32:12.4639060Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:12.4639576Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f06aab4c670>} 2025-05-07T20:32:12.4640321Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:12.4640523Z context = 2025-05-07T20:32:12.4640527Z 2025-05-07T20:32:12.4640700Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:12.4640964Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:12.4641082Z module_map=module_map) 2025-05-07T20:32:12.4641294Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:12.4641397Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:12.4641485Z E ^ 2025-05-07T20:32:12.4641842Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:12.4641847Z 2025-05-07T20:32:12.4642266Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:12.4642270Z 2025-05-07T20:32:12.4642452Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.4642677Z self=, 2025-05-07T20:32:12.4642841Z T=2048, 2025-05-07T20:32:12.4642922Z D=7168, 2025-05-07T20:32:12.4643011Z scale_ub=None, 2025-05-07T20:32:12.4643106Z contiguous=True, 2025-05-07T20:32:12.4643193Z compiled=True, 2025-05-07T20:32:12.4643278Z ) 2025-05-07T20:32:12.4643504Z self = 2025-05-07T20:32:12.4643680Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:12.4643684Z 2025-05-07T20:32:12.4643769Z @given( 2025-05-07T20:32:12.4643892Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.4643998Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.4644123Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.4644247Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.4644375Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.4644451Z ) 2025-05-07T20:32:12.4644707Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.4644810Z def test_silu_mul_quant( 2025-05-07T20:32:12.4644890Z self, 2025-05-07T20:32:12.4644971Z T: int, 2025-05-07T20:32:12.4645058Z D: int, 2025-05-07T20:32:12.4645162Z scale_ub: Optional[float], 2025-05-07T20:32:12.4645260Z contiguous: bool, 2025-05-07T20:32:12.4645354Z compiled: bool, 2025-05-07T20:32:12.4645437Z ) -> None: 2025-05-07T20:32:12.4645537Z torch.manual_seed(2025) 2025-05-07T20:32:12.4645618Z 2025-05-07T20:32:12.4645789Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.4645868Z 2025-05-07T20:32:12.4645971Z x_sign = torch.sign(x) 2025-05-07T20:32:12.4646099Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.4646201Z x = x_sign * x_clamp 2025-05-07T20:32:12.4646286Z x0 = x[:, :D] 2025-05-07T20:32:12.4646369Z x1 = x[:, D:] 2025-05-07T20:32:12.4646459Z 2025-05-07T20:32:12.4646547Z if contiguous: 2025-05-07T20:32:12.4646643Z x0 = x0.contiguous() 2025-05-07T20:32:12.4646741Z x1 = x1.contiguous() 2025-05-07T20:32:12.4646816Z 2025-05-07T20:32:12.4646909Z if scale_ub is not None: 2025-05-07T20:32:12.4647027Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:12.4647167Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:12.4647246Z ) 2025-05-07T20:32:12.4647335Z else: 2025-05-07T20:32:12.4647433Z scale_ub_tensor = None 2025-05-07T20:32:12.4647517Z 2025-05-07T20:32:12.4647648Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.4647740Z op = silu_mul_quant 2025-05-07T20:32:12.4647835Z if compiled: 2025-05-07T20:32:12.4647938Z op = torch.compile(op) 2025-05-07T20:32:12.4648050Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.4648131Z 2025-05-07T20:32:12.4648232Z > y_fp8, y_scale = fn() 2025-05-07T20:32:12.4648237Z 2025-05-07T20:32:12.4648341Z moe/activation_test.py:117: 2025-05-07T20:32:12.4648478Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.4648583Z moe/activation_test.py:115: in fn 2025-05-07T20:32:12.4648743Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.4649109Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:12.4649206Z return fn(*args, **kwargs) 2025-05-07T20:32:12.4649706Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:12.4649808Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:12.4650165Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:12.4650510Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:12.4650858Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:12.4650963Z kernel = self.compile( 2025-05-07T20:32:12.4651346Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:12.4651527Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:12.4651661Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.4651666Z 2025-05-07T20:32:12.4651876Z self = 2025-05-07T20:32:12.4652668Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:12.4653181Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f06aab4d1b0>} 2025-05-07T20:32:12.4653932Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:12.4654137Z context = 2025-05-07T20:32:12.4654141Z 2025-05-07T20:32:12.4654310Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:12.4654585Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:12.4654697Z module_map=module_map) 2025-05-07T20:32:12.4654868Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:12.4654977Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:12.4655063Z E ^ 2025-05-07T20:32:12.4655417Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:12.4655427Z 2025-05-07T20:32:12.4655841Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:12.4655848Z 2025-05-07T20:32:12.4655956Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.4656186Z self=, 2025-05-07T20:32:12.4656269Z T=16384, 2025-05-07T20:32:12.4656348Z D=5120, 2025-05-07T20:32:12.4656441Z scale_ub=None, 2025-05-07T20:32:12.4656532Z contiguous=False, 2025-05-07T20:32:12.4656621Z compiled=False, 2025-05-07T20:32:12.4656708Z ) 2025-05-07T20:32:12.4656926Z self = 2025-05-07T20:32:12.4657117Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:12.4657126Z 2025-05-07T20:32:12.4657208Z @given( 2025-05-07T20:32:12.4657334Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.4657443Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.4657564Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.4657732Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.4657856Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.4657934Z ) 2025-05-07T20:32:12.4658189Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.4658287Z def test_silu_mul_quant( 2025-05-07T20:32:12.4658366Z self, 2025-05-07T20:32:12.4658454Z T: int, 2025-05-07T20:32:12.4658533Z D: int, 2025-05-07T20:32:12.4658637Z scale_ub: Optional[float], 2025-05-07T20:32:12.4658785Z contiguous: bool, 2025-05-07T20:32:12.4658878Z compiled: bool, 2025-05-07T20:32:12.4658960Z ) -> None: 2025-05-07T20:32:12.4659138Z torch.manual_seed(2025) 2025-05-07T20:32:12.4659218Z 2025-05-07T20:32:12.4659389Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.4659472Z 2025-05-07T20:32:12.4659569Z x_sign = torch.sign(x) 2025-05-07T20:32:12.4659702Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.4661644Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:12.4661654Z 2025-05-07T20:32:12.4661791Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:12.4661795Z 2025-05-07T20:32:12.4661904Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.4668107Z self=, 2025-05-07T20:32:12.4668228Z T=4096, 2025-05-07T20:32:12.4668311Z D=7168, 2025-05-07T20:32:12.4668407Z scale_ub=1200.0, 2025-05-07T20:32:12.4668508Z contiguous=True, 2025-05-07T20:32:12.4668599Z compiled=True, 2025-05-07T20:32:12.4668677Z ) 2025-05-07T20:32:12.4668912Z self = 2025-05-07T20:32:12.4669093Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:12.4669099Z 2025-05-07T20:32:12.4669183Z @given( 2025-05-07T20:32:12.4669315Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.4669424Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.4669559Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.4669681Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.4669801Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.4669885Z ) 2025-05-07T20:32:12.4670141Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.4670244Z def test_silu_mul_quant( 2025-05-07T20:32:12.4670334Z self, 2025-05-07T20:32:12.4670415Z T: int, 2025-05-07T20:32:12.4670496Z D: int, 2025-05-07T20:32:12.4670605Z scale_ub: Optional[float], 2025-05-07T20:32:12.4670699Z contiguous: bool, 2025-05-07T20:32:12.4670789Z compiled: bool, 2025-05-07T20:32:12.4670883Z ) -> None: 2025-05-07T20:32:12.4670982Z torch.manual_seed(2025) 2025-05-07T20:32:12.4671065Z 2025-05-07T20:32:12.4671236Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.4671317Z 2025-05-07T20:32:12.4671422Z x_sign = torch.sign(x) 2025-05-07T20:32:12.4671556Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.4673352Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:12.4673455Z 2025-05-07T20:32:12.4673580Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:12.4673585Z 2025-05-07T20:32:12.4673694Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.4673979Z self=, 2025-05-07T20:32:12.4674063Z T=16384, 2025-05-07T20:32:12.4674223Z D=7168, 2025-05-07T20:32:12.4674319Z scale_ub=None, 2025-05-07T20:32:12.4674410Z contiguous=False, 2025-05-07T20:32:12.4674508Z compiled=False, 2025-05-07T20:32:12.4674588Z ) 2025-05-07T20:32:12.4674808Z self = 2025-05-07T20:32:12.4675002Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:12.4675007Z 2025-05-07T20:32:12.4675092Z @given( 2025-05-07T20:32:12.4675214Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.4675329Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.4675448Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.4675569Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.4675700Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.4675784Z ) 2025-05-07T20:32:12.4676046Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.4676146Z def test_silu_mul_quant( 2025-05-07T20:32:12.4676227Z self, 2025-05-07T20:32:12.4676317Z T: int, 2025-05-07T20:32:12.4676399Z D: int, 2025-05-07T20:32:12.4676504Z scale_ub: Optional[float], 2025-05-07T20:32:12.4679650Z contiguous: bool, 2025-05-07T20:32:12.4679762Z compiled: bool, 2025-05-07T20:32:12.4679845Z ) -> None: 2025-05-07T20:32:12.4679952Z torch.manual_seed(2025) 2025-05-07T20:32:12.4680031Z 2025-05-07T20:32:12.4680203Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.4682009Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:12.4682018Z 2025-05-07T20:32:12.4682143Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:12.4682182Z 2025-05-07T20:32:12.4682291Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.4682524Z self=, 2025-05-07T20:32:12.4682606Z T=2048, 2025-05-07T20:32:12.4682685Z D=7168, 2025-05-07T20:32:12.4682779Z scale_ub=1200.0, 2025-05-07T20:32:12.4682865Z contiguous=True, 2025-05-07T20:32:12.4682949Z compiled=True, 2025-05-07T20:32:12.4683039Z ) 2025-05-07T20:32:12.4683255Z self = 2025-05-07T20:32:12.4683431Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:12.4683436Z 2025-05-07T20:32:12.4683525Z @given( 2025-05-07T20:32:12.4683646Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.4683750Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.4683874Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.4684062Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.4684189Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.4684267Z ) 2025-05-07T20:32:12.4684518Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.4684623Z def test_silu_mul_quant( 2025-05-07T20:32:12.4684702Z self, 2025-05-07T20:32:12.4684783Z T: int, 2025-05-07T20:32:12.4684871Z D: int, 2025-05-07T20:32:12.4684976Z scale_ub: Optional[float], 2025-05-07T20:32:12.4685070Z contiguous: bool, 2025-05-07T20:32:12.4685213Z compiled: bool, 2025-05-07T20:32:12.4685294Z ) -> None: 2025-05-07T20:32:12.4685440Z torch.manual_seed(2025) 2025-05-07T20:32:12.4685518Z 2025-05-07T20:32:12.4685690Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.4685776Z 2025-05-07T20:32:12.4685874Z x_sign = torch.sign(x) 2025-05-07T20:32:12.4686008Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.4687823Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:12.4687832Z 2025-05-07T20:32:12.4687957Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:12.4687961Z 2025-05-07T20:32:12.4688077Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.4688298Z self=, 2025-05-07T20:32:12.4688378Z T=2048, 2025-05-07T20:32:12.4688470Z D=7168, 2025-05-07T20:32:12.4688647Z scale_ub=None, 2025-05-07T20:32:12.4688746Z contiguous=True, 2025-05-07T20:32:12.4688835Z compiled=False, 2025-05-07T20:32:12.4688911Z ) 2025-05-07T20:32:12.4689137Z self = 2025-05-07T20:32:12.4689314Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:12.4689319Z 2025-05-07T20:32:12.4689399Z @given( 2025-05-07T20:32:12.4689527Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.4689631Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.4689749Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.4690169Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.4690343Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.4690461Z ) 2025-05-07T20:32:12.4690715Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.4690818Z def test_silu_mul_quant( 2025-05-07T20:32:12.4690904Z self, 2025-05-07T20:32:12.4690984Z T: int, 2025-05-07T20:32:12.4691064Z D: int, 2025-05-07T20:32:12.4691173Z scale_ub: Optional[float], 2025-05-07T20:32:12.4691265Z contiguous: bool, 2025-05-07T20:32:12.4691353Z compiled: bool, 2025-05-07T20:32:12.4691440Z ) -> None: 2025-05-07T20:32:12.4691537Z torch.manual_seed(2025) 2025-05-07T20:32:12.4691612Z 2025-05-07T20:32:12.4691788Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.4691868Z 2025-05-07T20:32:12.4691964Z > x_sign = torch.sign(x) 2025-05-07T20:32:12.4693780Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:12.4693879Z 2025-05-07T20:32:12.4694011Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:12.4694016Z 2025-05-07T20:32:12.4694124Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.4694346Z self=, 2025-05-07T20:32:12.4694499Z T=1, 2025-05-07T20:32:12.4694578Z D=7168, 2025-05-07T20:32:12.4694668Z scale_ub=1200.0, 2025-05-07T20:32:12.4694816Z contiguous=True, 2025-05-07T20:32:12.4694903Z compiled=False, 2025-05-07T20:32:12.4694983Z ) 2025-05-07T20:32:12.4695200Z self = 2025-05-07T20:32:12.4695365Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:12.4695380Z 2025-05-07T20:32:12.4695459Z @given( 2025-05-07T20:32:12.4695580Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.4695690Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.4695806Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.4695926Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.4696049Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.4696126Z ) 2025-05-07T20:32:12.4696371Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.4696482Z def test_silu_mul_quant( 2025-05-07T20:32:12.4696562Z self, 2025-05-07T20:32:12.4696651Z T: int, 2025-05-07T20:32:12.4696729Z D: int, 2025-05-07T20:32:12.4696828Z scale_ub: Optional[float], 2025-05-07T20:32:12.4696927Z contiguous: bool, 2025-05-07T20:32:12.4697015Z compiled: bool, 2025-05-07T20:32:12.4697097Z ) -> None: 2025-05-07T20:32:12.4697295Z torch.manual_seed(2025) 2025-05-07T20:32:12.4697372Z 2025-05-07T20:32:12.4697542Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.4697625Z 2025-05-07T20:32:12.4697721Z x_sign = torch.sign(x) 2025-05-07T20:32:12.4697847Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.4697948Z x = x_sign * x_clamp 2025-05-07T20:32:12.4698032Z x0 = x[:, :D] 2025-05-07T20:32:12.4698115Z x1 = x[:, D:] 2025-05-07T20:32:12.4698200Z 2025-05-07T20:32:12.4698289Z if contiguous: 2025-05-07T20:32:12.4698394Z x0 = x0.contiguous() 2025-05-07T20:32:12.4698494Z x1 = x1.contiguous() 2025-05-07T20:32:12.4698568Z 2025-05-07T20:32:12.4698672Z if scale_ub is not None: 2025-05-07T20:32:12.4698780Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:12.4698918Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:12.4699006Z ) 2025-05-07T20:32:12.4699088Z else: 2025-05-07T20:32:12.4699186Z scale_ub_tensor = None 2025-05-07T20:32:12.4699266Z 2025-05-07T20:32:12.4699400Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.4699492Z op = silu_mul_quant 2025-05-07T20:32:12.4699588Z if compiled: 2025-05-07T20:32:12.4699691Z op = torch.compile(op) 2025-05-07T20:32:12.4699882Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.4699958Z 2025-05-07T20:32:12.4700056Z > y_fp8, y_scale = fn() 2025-05-07T20:32:12.4700060Z 2025-05-07T20:32:12.4700166Z moe/activation_test.py:117: 2025-05-07T20:32:12.4700298Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.4700407Z moe/activation_test.py:115: in fn 2025-05-07T20:32:12.4700517Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.4701023Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:12.4701174Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:12.4701541Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:12.4701764Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:12.4702119Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:12.4702215Z kernel = self.compile( 2025-05-07T20:32:12.4702715Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:12.4702898Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:12.4703026Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.4703030Z 2025-05-07T20:32:12.4703249Z self = 2025-05-07T20:32:12.4704041Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:12.4704542Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f06aab4ee60>} 2025-05-07T20:32:12.4705306Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:12.4705501Z context = 2025-05-07T20:32:12.4705506Z 2025-05-07T20:32:12.4705682Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:12.4705998Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:12.4706110Z module_map=module_map) 2025-05-07T20:32:12.4706285Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:12.4706385Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:12.4706471Z E ^ 2025-05-07T20:32:12.4706830Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:12.4706835Z 2025-05-07T20:32:12.4707250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:12.4707257Z 2025-05-07T20:32:12.4707369Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.4707590Z self=, 2025-05-07T20:32:12.4707678Z T=128, 2025-05-07T20:32:12.4707757Z D=5120, 2025-05-07T20:32:12.4707847Z scale_ub=None, 2025-05-07T20:32:12.4707944Z contiguous=True, 2025-05-07T20:32:12.4708030Z compiled=False, 2025-05-07T20:32:12.4708106Z ) 2025-05-07T20:32:12.4708327Z self = 2025-05-07T20:32:12.4708498Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:12.4708502Z 2025-05-07T20:32:12.4708584Z @given( 2025-05-07T20:32:12.4708710Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.4708813Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.4708944Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.4709064Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.4709183Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.4709266Z ) 2025-05-07T20:32:12.4709515Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.4709611Z def test_silu_mul_quant( 2025-05-07T20:32:12.4709749Z self, 2025-05-07T20:32:12.4709827Z T: int, 2025-05-07T20:32:12.4709905Z D: int, 2025-05-07T20:32:12.4710013Z scale_ub: Optional[float], 2025-05-07T20:32:12.4710105Z contiguous: bool, 2025-05-07T20:32:12.4710193Z compiled: bool, 2025-05-07T20:32:12.4710283Z ) -> None: 2025-05-07T20:32:12.4710380Z torch.manual_seed(2025) 2025-05-07T20:32:12.4710461Z 2025-05-07T20:32:12.4710630Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.4710836Z 2025-05-07T20:32:12.4710938Z x_sign = torch.sign(x) 2025-05-07T20:32:12.4711065Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.4711195Z x = x_sign * x_clamp 2025-05-07T20:32:12.4711289Z x0 = x[:, :D] 2025-05-07T20:32:12.4711372Z x1 = x[:, D:] 2025-05-07T20:32:12.4711446Z 2025-05-07T20:32:12.4711538Z if contiguous: 2025-05-07T20:32:12.4711633Z x0 = x0.contiguous() 2025-05-07T20:32:12.4711728Z x1 = x1.contiguous() 2025-05-07T20:32:12.4711809Z 2025-05-07T20:32:12.4711902Z if scale_ub is not None: 2025-05-07T20:32:12.4712014Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:12.4712157Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:12.4712237Z ) 2025-05-07T20:32:12.4712324Z else: 2025-05-07T20:32:12.4712422Z scale_ub_tensor = None 2025-05-07T20:32:12.4712496Z 2025-05-07T20:32:12.4712636Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.4712733Z op = silu_mul_quant 2025-05-07T20:32:12.4712821Z if compiled: 2025-05-07T20:32:12.4712933Z op = torch.compile(op) 2025-05-07T20:32:12.4713044Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.4713118Z 2025-05-07T20:32:12.4713218Z > y_fp8, y_scale = fn() 2025-05-07T20:32:12.4713222Z 2025-05-07T20:32:12.4713326Z moe/activation_test.py:117: 2025-05-07T20:32:12.4713514Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.4713626Z moe/activation_test.py:115: in fn 2025-05-07T20:32:12.4713729Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.4714241Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:12.4714340Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:12.4714699Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:12.4714934Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:12.4715279Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:12.4715381Z kernel = self.compile( 2025-05-07T20:32:12.4715766Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:12.4715944Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:12.4716079Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.4716083Z 2025-05-07T20:32:12.4716289Z self = 2025-05-07T20:32:12.4717067Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:12.4717571Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f06aab4f7f0>} 2025-05-07T20:32:12.4718329Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:12.4718573Z context = 2025-05-07T20:32:12.4718578Z 2025-05-07T20:32:12.4718747Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:12.4719014Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:12.4719126Z module_map=module_map) 2025-05-07T20:32:12.4719289Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:12.4719441Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:12.4719519Z E ^ 2025-05-07T20:32:12.4719918Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:12.4719929Z 2025-05-07T20:32:12.4720348Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:12.4720357Z 2025-05-07T20:32:12.4720466Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.4720695Z self=, 2025-05-07T20:32:12.4720775Z T=128, 2025-05-07T20:32:12.4720853Z D=7168, 2025-05-07T20:32:12.4720947Z scale_ub=None, 2025-05-07T20:32:12.4721034Z contiguous=True, 2025-05-07T20:32:12.4721120Z compiled=False, 2025-05-07T20:32:12.4721201Z ) 2025-05-07T20:32:12.4721417Z self = 2025-05-07T20:32:12.4721601Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:12.4721606Z 2025-05-07T20:32:12.4721689Z @given( 2025-05-07T20:32:12.4721810Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.4721919Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.4722037Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.4722206Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.4722327Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.4722403Z ) 2025-05-07T20:32:12.4722648Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.4722749Z def test_silu_mul_quant( 2025-05-07T20:32:12.4722827Z self, 2025-05-07T20:32:12.4722909Z T: int, 2025-05-07T20:32:12.4722987Z D: int, 2025-05-07T20:32:12.4723089Z scale_ub: Optional[float], 2025-05-07T20:32:12.4723185Z contiguous: bool, 2025-05-07T20:32:12.4723277Z compiled: bool, 2025-05-07T20:32:12.4723358Z ) -> None: 2025-05-07T20:32:12.4723464Z torch.manual_seed(2025) 2025-05-07T20:32:12.4723539Z 2025-05-07T20:32:12.4723707Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.4723789Z 2025-05-07T20:32:12.4723885Z x_sign = torch.sign(x) 2025-05-07T20:32:12.4724017Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.4724116Z x = x_sign * x_clamp 2025-05-07T20:32:12.4724200Z x0 = x[:, :D] 2025-05-07T20:32:12.4724287Z x1 = x[:, D:] 2025-05-07T20:32:12.4724361Z 2025-05-07T20:32:12.4724446Z if contiguous: 2025-05-07T20:32:12.4724546Z x0 = x0.contiguous() 2025-05-07T20:32:12.4724637Z x1 = x1.contiguous() 2025-05-07T20:32:12.4724710Z 2025-05-07T20:32:12.4724809Z if scale_ub is not None: 2025-05-07T20:32:12.4724918Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:12.4725060Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:12.4725146Z ) 2025-05-07T20:32:12.4725229Z else: 2025-05-07T20:32:12.4725327Z scale_ub_tensor = None 2025-05-07T20:32:12.4725410Z 2025-05-07T20:32:12.4725541Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.4725639Z op = silu_mul_quant 2025-05-07T20:32:12.4725775Z if compiled: 2025-05-07T20:32:12.4725887Z op = torch.compile(op) 2025-05-07T20:32:12.4726007Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.4726085Z 2025-05-07T20:32:12.4726179Z > y_fp8, y_scale = fn() 2025-05-07T20:32:12.4726184Z 2025-05-07T20:32:12.4726289Z moe/activation_test.py:117: 2025-05-07T20:32:12.4726418Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.4726520Z moe/activation_test.py:115: in fn 2025-05-07T20:32:12.4726628Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.4727217Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:12.4727325Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:12.4727691Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:12.4727916Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:12.4728267Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:12.4728365Z kernel = self.compile( 2025-05-07T20:32:12.4728746Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:12.4728929Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:12.4729057Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.4729065Z 2025-05-07T20:32:12.4729281Z self = 2025-05-07T20:32:12.4730051Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:12.4730603Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f06aaa98160>} 2025-05-07T20:32:12.4731361Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:12.4731553Z context = 2025-05-07T20:32:12.4731557Z 2025-05-07T20:32:12.4731734Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:12.4731999Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:12.4732114Z module_map=module_map) 2025-05-07T20:32:12.4732280Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:12.4732400Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:12.4732494Z E ^ 2025-05-07T20:32:12.4732857Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:12.4732862Z 2025-05-07T20:32:12.4733283Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:12.4733288Z 2025-05-07T20:32:12.4733396Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.4733619Z self=, 2025-05-07T20:32:12.4733708Z T=2048, 2025-05-07T20:32:12.4733787Z D=7168, 2025-05-07T20:32:12.4733880Z scale_ub=1200.0, 2025-05-07T20:32:12.4733973Z contiguous=True, 2025-05-07T20:32:12.4734060Z compiled=False, 2025-05-07T20:32:12.4734144Z ) 2025-05-07T20:32:12.4734361Z self = 2025-05-07T20:32:12.4734536Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:12.4734613Z 2025-05-07T20:32:12.4734702Z @given( 2025-05-07T20:32:12.4734824Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.4734926Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.4735051Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.4735169Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.4735293Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.4735372Z ) 2025-05-07T20:32:12.4735621Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.4735768Z def test_silu_mul_quant( 2025-05-07T20:32:12.4735848Z self, 2025-05-07T20:32:12.4735965Z T: int, 2025-05-07T20:32:12.4736050Z D: int, 2025-05-07T20:32:12.4736153Z scale_ub: Optional[float], 2025-05-07T20:32:12.4736246Z contiguous: bool, 2025-05-07T20:32:12.4736339Z compiled: bool, 2025-05-07T20:32:12.4736424Z ) -> None: 2025-05-07T20:32:12.4736523Z torch.manual_seed(2025) 2025-05-07T20:32:12.4736605Z 2025-05-07T20:32:12.4736776Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.4738566Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:12.4738575Z 2025-05-07T20:32:12.4738697Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:12.4738702Z 2025-05-07T20:32:12.4738812Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.4739083Z self=, 2025-05-07T20:32:12.4739166Z T=1, 2025-05-07T20:32:12.4739250Z D=5120, 2025-05-07T20:32:12.4739335Z scale_ub=1200.0, 2025-05-07T20:32:12.4739422Z contiguous=True, 2025-05-07T20:32:12.4739512Z compiled=False, 2025-05-07T20:32:12.4739586Z ) 2025-05-07T20:32:12.4739864Z self = 2025-05-07T20:32:12.4740038Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:12.4740046Z 2025-05-07T20:32:12.4740123Z @given( 2025-05-07T20:32:12.4740248Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.4740350Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.4740465Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.4740589Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.4740703Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.4740786Z ) 2025-05-07T20:32:12.4741040Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.4741130Z def test_silu_mul_quant( 2025-05-07T20:32:12.4741206Z self, 2025-05-07T20:32:12.4741288Z T: int, 2025-05-07T20:32:12.4741365Z D: int, 2025-05-07T20:32:12.4741473Z scale_ub: Optional[float], 2025-05-07T20:32:12.4741561Z contiguous: bool, 2025-05-07T20:32:12.4741647Z compiled: bool, 2025-05-07T20:32:12.4741731Z ) -> None: 2025-05-07T20:32:12.4741829Z torch.manual_seed(2025) 2025-05-07T20:32:12.4741899Z 2025-05-07T20:32:12.4742077Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.4742149Z 2025-05-07T20:32:12.4742242Z x_sign = torch.sign(x) 2025-05-07T20:32:12.4742373Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.4742461Z x = x_sign * x_clamp 2025-05-07T20:32:12.4742594Z x0 = x[:, :D] 2025-05-07T20:32:12.4742685Z x1 = x[:, D:] 2025-05-07T20:32:12.4742758Z 2025-05-07T20:32:12.4742841Z if contiguous: 2025-05-07T20:32:12.4742938Z x0 = x0.contiguous() 2025-05-07T20:32:12.4743028Z x1 = x1.contiguous() 2025-05-07T20:32:12.4743104Z 2025-05-07T20:32:12.4743195Z if scale_ub is not None: 2025-05-07T20:32:12.4743300Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:12.4743440Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:12.4743560Z ) 2025-05-07T20:32:12.4743638Z else: 2025-05-07T20:32:12.4743740Z scale_ub_tensor = None 2025-05-07T20:32:12.4743811Z 2025-05-07T20:32:12.4743982Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.4744081Z op = silu_mul_quant 2025-05-07T20:32:12.4744168Z if compiled: 2025-05-07T20:32:12.4744269Z op = torch.compile(op) 2025-05-07T20:32:12.4744390Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.4744465Z 2025-05-07T20:32:12.4744564Z > y_fp8, y_scale = fn() 2025-05-07T20:32:12.4744568Z 2025-05-07T20:32:12.4744666Z moe/activation_test.py:117: 2025-05-07T20:32:12.4744794Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.4744903Z moe/activation_test.py:115: in fn 2025-05-07T20:32:12.4745001Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.4745499Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:12.4745610Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:12.4745977Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:12.4746212Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:12.4746604Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:12.4746711Z kernel = self.compile( 2025-05-07T20:32:12.4747101Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:12.4747276Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:12.4747402Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.4747415Z 2025-05-07T20:32:12.4747622Z self = 2025-05-07T20:32:12.4748412Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:12.4748929Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f06aaa98940>} 2025-05-07T20:32:12.4749684Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:12.4749880Z context = 2025-05-07T20:32:12.4749885Z 2025-05-07T20:32:12.4750054Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:12.4750321Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:12.4750440Z module_map=module_map) 2025-05-07T20:32:12.4750606Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:12.4750709Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:12.4750787Z E ^ 2025-05-07T20:32:12.4751143Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:12.4751190Z 2025-05-07T20:32:12.4751612Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:12.4751616Z 2025-05-07T20:32:12.4751720Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.4751942Z self=, 2025-05-07T20:32:12.4752022Z T=2048, 2025-05-07T20:32:12.4752100Z D=5120, 2025-05-07T20:32:12.4752189Z scale_ub=None, 2025-05-07T20:32:12.4752317Z contiguous=True, 2025-05-07T20:32:12.4752404Z compiled=False, 2025-05-07T20:32:12.4752486Z ) 2025-05-07T20:32:12.4752739Z self = 2025-05-07T20:32:12.4752915Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:12.4752919Z 2025-05-07T20:32:12.4753005Z @given( 2025-05-07T20:32:12.4753132Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.4753232Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.4753356Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.4753475Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.4753598Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.4753669Z ) 2025-05-07T20:32:12.4753912Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.4754012Z def test_silu_mul_quant( 2025-05-07T20:32:12.4754090Z self, 2025-05-07T20:32:12.4754165Z T: int, 2025-05-07T20:32:12.4754247Z D: int, 2025-05-07T20:32:12.4754348Z scale_ub: Optional[float], 2025-05-07T20:32:12.4754436Z contiguous: bool, 2025-05-07T20:32:12.4754527Z compiled: bool, 2025-05-07T20:32:12.4754607Z ) -> None: 2025-05-07T20:32:12.4754700Z torch.manual_seed(2025) 2025-05-07T20:32:12.4754779Z 2025-05-07T20:32:12.4754997Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.4755083Z 2025-05-07T20:32:12.4755179Z > x_sign = torch.sign(x) 2025-05-07T20:32:12.4756957Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:12.4756971Z 2025-05-07T20:32:12.4757089Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:12.4757094Z 2025-05-07T20:32:12.4757197Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.4757430Z self=, 2025-05-07T20:32:12.4757509Z T=16384, 2025-05-07T20:32:12.4757582Z D=5120, 2025-05-07T20:32:12.4757669Z scale_ub=None, 2025-05-07T20:32:12.4757754Z contiguous=True, 2025-05-07T20:32:12.4757838Z compiled=False, 2025-05-07T20:32:12.4757916Z ) 2025-05-07T20:32:12.4758130Z self = 2025-05-07T20:32:12.4758312Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:12.4758316Z 2025-05-07T20:32:12.4758394Z @given( 2025-05-07T20:32:12.4758511Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.4758619Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.4758735Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.4758850Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.4758968Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.4759089Z ) 2025-05-07T20:32:12.4759333Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.4759431Z def test_silu_mul_quant( 2025-05-07T20:32:12.4759509Z self, 2025-05-07T20:32:12.4759589Z T: int, 2025-05-07T20:32:12.4759666Z D: int, 2025-05-07T20:32:12.4759765Z scale_ub: Optional[float], 2025-05-07T20:32:12.4759860Z contiguous: bool, 2025-05-07T20:32:12.4759948Z compiled: bool, 2025-05-07T20:32:12.4760023Z ) -> None: 2025-05-07T20:32:12.4760126Z torch.manual_seed(2025) 2025-05-07T20:32:12.4760253Z 2025-05-07T20:32:12.4760421Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.4762275Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:12.4762284Z 2025-05-07T20:32:12.4762405Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:12.4762409Z 2025-05-07T20:32:12.4762518Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.4762737Z self=, 2025-05-07T20:32:12.4762826Z T=4096, 2025-05-07T20:32:12.4762901Z D=5120, 2025-05-07T20:32:12.4762986Z scale_ub=None, 2025-05-07T20:32:12.4763077Z contiguous=True, 2025-05-07T20:32:12.4763162Z compiled=False, 2025-05-07T20:32:12.4763238Z ) 2025-05-07T20:32:12.4763455Z self = 2025-05-07T20:32:12.4763700Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:12.4763705Z 2025-05-07T20:32:12.4763785Z @given( 2025-05-07T20:32:12.4763908Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.4764009Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.4764131Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.4764252Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.4764365Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.4764445Z ) 2025-05-07T20:32:12.4764691Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.4764786Z def test_silu_mul_quant( 2025-05-07T20:32:12.4764873Z self, 2025-05-07T20:32:12.4764948Z T: int, 2025-05-07T20:32:12.4765024Z D: int, 2025-05-07T20:32:12.4765128Z scale_ub: Optional[float], 2025-05-07T20:32:12.4765216Z contiguous: bool, 2025-05-07T20:32:12.4765305Z compiled: bool, 2025-05-07T20:32:12.4765392Z ) -> None: 2025-05-07T20:32:12.4765485Z torch.manual_seed(2025) 2025-05-07T20:32:12.4765559Z 2025-05-07T20:32:12.4765734Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.4767975Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:12.4767991Z 2025-05-07T20:32:12.4768109Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:12.4768113Z 2025-05-07T20:32:12.4768262Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.4768490Z self=, 2025-05-07T20:32:12.4768569Z T=2048, 2025-05-07T20:32:12.4768645Z D=5120, 2025-05-07T20:32:12.4768733Z scale_ub=None, 2025-05-07T20:32:12.4768821Z contiguous=False, 2025-05-07T20:32:12.4768903Z compiled=False, 2025-05-07T20:32:12.4768984Z ) 2025-05-07T20:32:12.4769194Z self = 2025-05-07T20:32:12.4769371Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:12.4769416Z 2025-05-07T20:32:12.4769493Z @given( 2025-05-07T20:32:12.4769648Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.4769759Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.4769872Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.4769991Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.4770114Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.4770189Z ) 2025-05-07T20:32:12.4770436Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.4770534Z def test_silu_mul_quant( 2025-05-07T20:32:12.4770610Z self, 2025-05-07T20:32:12.4770692Z T: int, 2025-05-07T20:32:12.4770766Z D: int, 2025-05-07T20:32:12.4770864Z scale_ub: Optional[float], 2025-05-07T20:32:12.4770959Z contiguous: bool, 2025-05-07T20:32:12.4771043Z compiled: bool, 2025-05-07T20:32:12.4771122Z ) -> None: 2025-05-07T20:32:12.4771221Z torch.manual_seed(2025) 2025-05-07T20:32:12.4771294Z 2025-05-07T20:32:12.4771463Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.4773280Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:12.4773290Z 2025-05-07T20:32:12.4773407Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:12.4773411Z 2025-05-07T20:32:12.4773520Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.4773742Z self=, 2025-05-07T20:32:12.4773822Z T=4096, 2025-05-07T20:32:12.4773904Z D=7168, 2025-05-07T20:32:12.4773985Z scale_ub=None, 2025-05-07T20:32:12.4774074Z contiguous=True, 2025-05-07T20:32:12.4774156Z compiled=True, 2025-05-07T20:32:12.4774227Z ) 2025-05-07T20:32:12.4774447Z self = 2025-05-07T20:32:12.4774620Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:12.4774625Z 2025-05-07T20:32:12.4774703Z @given( 2025-05-07T20:32:12.4774827Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.4774925Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.4775045Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.4775159Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.4775273Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.4775356Z ) 2025-05-07T20:32:12.4775607Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.4775700Z def test_silu_mul_quant( 2025-05-07T20:32:12.4775785Z self, 2025-05-07T20:32:12.4775861Z T: int, 2025-05-07T20:32:12.4775935Z D: int, 2025-05-07T20:32:12.4776040Z scale_ub: Optional[float], 2025-05-07T20:32:12.4776177Z contiguous: bool, 2025-05-07T20:32:12.4776265Z compiled: bool, 2025-05-07T20:32:12.4776348Z ) -> None: 2025-05-07T20:32:12.4776444Z torch.manual_seed(2025) 2025-05-07T20:32:12.4776521Z 2025-05-07T20:32:12.4776693Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.4778506Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:12.4778556Z 2025-05-07T20:32:12.4778674Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:12.4778684Z 2025-05-07T20:32:12.4778789Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.4779014Z self=, 2025-05-07T20:32:12.4779091Z T=2048, 2025-05-07T20:32:12.4779167Z D=5120, 2025-05-07T20:32:12.4779260Z scale_ub=1200.0, 2025-05-07T20:32:12.4779345Z contiguous=False, 2025-05-07T20:32:12.4779429Z compiled=False, 2025-05-07T20:32:12.4779513Z ) 2025-05-07T20:32:12.4779727Z self = 2025-05-07T20:32:12.4779983Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:12.4779988Z 2025-05-07T20:32:12.4780067Z @given( 2025-05-07T20:32:12.4780185Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.4780291Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.4780406Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.4780574Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.4780701Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.4780775Z ) 2025-05-07T20:32:12.4781017Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.4781118Z def test_silu_mul_quant( 2025-05-07T20:32:12.4781197Z self, 2025-05-07T20:32:12.4781278Z T: int, 2025-05-07T20:32:12.4781355Z D: int, 2025-05-07T20:32:12.4781451Z scale_ub: Optional[float], 2025-05-07T20:32:12.4781546Z contiguous: bool, 2025-05-07T20:32:12.4781635Z compiled: bool, 2025-05-07T20:32:12.4781712Z ) -> None: 2025-05-07T20:32:12.4781815Z torch.manual_seed(2025) 2025-05-07T20:32:12.4781888Z 2025-05-07T20:32:12.4782056Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.4783829Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:12.4783836Z 2025-05-07T20:32:12.4783951Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:12.4783959Z 2025-05-07T20:32:12.4784065Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.4784287Z self=, 2025-05-07T20:32:12.4784368Z T=4096, 2025-05-07T20:32:12.4784441Z D=7168, 2025-05-07T20:32:12.4784521Z scale_ub=1200.0, 2025-05-07T20:32:12.4784614Z contiguous=True, 2025-05-07T20:32:12.4784697Z compiled=False, 2025-05-07T20:32:12.4784819Z ) 2025-05-07T20:32:12.4785042Z self = 2025-05-07T20:32:12.4785213Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:12.4785218Z 2025-05-07T20:32:12.4785294Z @given( 2025-05-07T20:32:12.4785414Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.4785511Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.4785630Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.4785745Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.4785902Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.4785984Z ) 2025-05-07T20:32:12.4786265Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.4786358Z def test_silu_mul_quant( 2025-05-07T20:32:12.4786437Z self, 2025-05-07T20:32:12.4786511Z T: int, 2025-05-07T20:32:12.4786589Z D: int, 2025-05-07T20:32:12.4786700Z scale_ub: Optional[float], 2025-05-07T20:32:12.4786787Z contiguous: bool, 2025-05-07T20:32:12.4786872Z compiled: bool, 2025-05-07T20:32:12.4786956Z ) -> None: 2025-05-07T20:32:12.4787049Z torch.manual_seed(2025) 2025-05-07T20:32:12.4787128Z 2025-05-07T20:32:12.4787293Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.4789064Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:12.4789080Z 2025-05-07T20:32:12.4789241Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:12.4789246Z 2025-05-07T20:32:12.4789352Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.4789575Z self=, 2025-05-07T20:32:12.4789652Z T=16384, 2025-05-07T20:32:12.4789730Z D=7168, 2025-05-07T20:32:12.4790120Z scale_ub=None, 2025-05-07T20:32:12.4790252Z contiguous=False, 2025-05-07T20:32:12.4790336Z compiled=True, 2025-05-07T20:32:12.4790416Z ) 2025-05-07T20:32:12.4790635Z self = 2025-05-07T20:32:12.4790822Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:12.4790826Z 2025-05-07T20:32:12.4790902Z @given( 2025-05-07T20:32:12.4791017Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.4791124Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.4791242Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.4791360Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.4791477Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.4791549Z ) 2025-05-07T20:32:12.4791790Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.4791889Z def test_silu_mul_quant( 2025-05-07T20:32:12.4791967Z self, 2025-05-07T20:32:12.4792047Z T: int, 2025-05-07T20:32:12.4792123Z D: int, 2025-05-07T20:32:12.4792224Z scale_ub: Optional[float], 2025-05-07T20:32:12.4792320Z contiguous: bool, 2025-05-07T20:32:12.4792404Z compiled: bool, 2025-05-07T20:32:12.4792484Z ) -> None: 2025-05-07T20:32:12.4792586Z torch.manual_seed(2025) 2025-05-07T20:32:12.4792657Z 2025-05-07T20:32:12.4792822Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.4794681Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:12.4794827Z 2025-05-07T20:32:12.4794948Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:12.4794958Z 2025-05-07T20:32:12.4795139Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.4795360Z self=, 2025-05-07T20:32:12.4795444Z T=4096, 2025-05-07T20:32:12.4795521Z D=7168, 2025-05-07T20:32:12.4795604Z scale_ub=None, 2025-05-07T20:32:12.4795698Z contiguous=True, 2025-05-07T20:32:12.4795788Z compiled=False, 2025-05-07T20:32:12.4795863Z ) 2025-05-07T20:32:12.4796084Z self = 2025-05-07T20:32:12.4796254Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:12.4796259Z 2025-05-07T20:32:12.4801857Z @given( 2025-05-07T20:32:12.4802019Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.4802126Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.4802251Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.4802392Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.4802514Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.4802602Z ) 2025-05-07T20:32:12.4802855Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.4802959Z def test_silu_mul_quant( 2025-05-07T20:32:12.4803049Z self, 2025-05-07T20:32:12.4803136Z T: int, 2025-05-07T20:32:12.4803322Z D: int, 2025-05-07T20:32:12.4803437Z scale_ub: Optional[float], 2025-05-07T20:32:12.4803532Z contiguous: bool, 2025-05-07T20:32:12.4803622Z compiled: bool, 2025-05-07T20:32:12.4803715Z ) -> None: 2025-05-07T20:32:12.4803815Z torch.manual_seed(2025) 2025-05-07T20:32:12.4803892Z 2025-05-07T20:32:12.4804072Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.4805863Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:12.4805881Z 2025-05-07T20:32:12.4806006Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:12.4806011Z 2025-05-07T20:32:12.4806118Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.4806351Z self=, 2025-05-07T20:32:12.4806432Z T=16384, 2025-05-07T20:32:12.4806512Z D=7168, 2025-05-07T20:32:12.4806605Z scale_ub=None, 2025-05-07T20:32:12.4806695Z contiguous=True, 2025-05-07T20:32:12.4806786Z compiled=False, 2025-05-07T20:32:12.4806870Z ) 2025-05-07T20:32:12.4807091Z self = 2025-05-07T20:32:12.4807272Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:12.4807283Z 2025-05-07T20:32:12.4807362Z @given( 2025-05-07T20:32:12.4807485Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.4807652Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.4807771Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.4807892Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.4808016Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.4808094Z ) 2025-05-07T20:32:12.4808343Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.4808449Z def test_silu_mul_quant( 2025-05-07T20:32:12.4808529Z self, 2025-05-07T20:32:12.4808610Z T: int, 2025-05-07T20:32:12.4808739Z D: int, 2025-05-07T20:32:12.4808844Z scale_ub: Optional[float], 2025-05-07T20:32:12.4808983Z contiguous: bool, 2025-05-07T20:32:12.4809074Z compiled: bool, 2025-05-07T20:32:12.4809154Z ) -> None: 2025-05-07T20:32:12.4809262Z torch.manual_seed(2025) 2025-05-07T20:32:12.4809339Z 2025-05-07T20:32:12.4809507Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.4811290Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:12.4811298Z 2025-05-07T20:32:12.4811417Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:12.4811425Z 2025-05-07T20:32:12.4811537Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.4811758Z self=, 2025-05-07T20:32:12.4811844Z T=16384, 2025-05-07T20:32:12.4811920Z D=7168, 2025-05-07T20:32:12.4812052Z scale_ub=1200.0, 2025-05-07T20:32:12.4812146Z contiguous=True, 2025-05-07T20:32:12.4812232Z compiled=False, 2025-05-07T20:32:12.4812311Z ) 2025-05-07T20:32:12.4812530Z self = 2025-05-07T20:32:12.4812711Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:12.4812715Z 2025-05-07T20:32:12.4812795Z @given( 2025-05-07T20:32:12.4812919Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.4813022Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.4813142Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.4813274Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.4813390Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.4813469Z ) 2025-05-07T20:32:12.4813714Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.4813811Z def test_silu_mul_quant( 2025-05-07T20:32:12.4813896Z self, 2025-05-07T20:32:12.4813974Z T: int, 2025-05-07T20:32:12.4814051Z D: int, 2025-05-07T20:32:12.4814161Z scale_ub: Optional[float], 2025-05-07T20:32:12.4814257Z contiguous: bool, 2025-05-07T20:32:12.4814343Z compiled: bool, 2025-05-07T20:32:12.4814432Z ) -> None: 2025-05-07T20:32:12.4814527Z torch.manual_seed(2025) 2025-05-07T20:32:12.4814599Z 2025-05-07T20:32:12.4814772Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.4816584Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:12.4816643Z 2025-05-07T20:32:12.4816764Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:12.4816768Z 2025-05-07T20:32:12.4816872Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.4817101Z self=, 2025-05-07T20:32:12.4817177Z T=128, 2025-05-07T20:32:12.4817255Z D=5120, 2025-05-07T20:32:12.4817388Z scale_ub=1200.0, 2025-05-07T20:32:12.4817476Z contiguous=False, 2025-05-07T20:32:12.4817560Z compiled=False, 2025-05-07T20:32:12.4817637Z ) 2025-05-07T20:32:12.4817894Z self = 2025-05-07T20:32:12.4818071Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:12.4818083Z 2025-05-07T20:32:12.4818161Z @given( 2025-05-07T20:32:12.4818285Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.4818393Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.4818511Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.4818629Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.4818749Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.4818823Z ) 2025-05-07T20:32:12.4819068Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.4819171Z def test_silu_mul_quant( 2025-05-07T20:32:12.4819249Z self, 2025-05-07T20:32:12.4819332Z T: int, 2025-05-07T20:32:12.4819408Z D: int, 2025-05-07T20:32:12.4819509Z scale_ub: Optional[float], 2025-05-07T20:32:12.4819603Z contiguous: bool, 2025-05-07T20:32:12.4819690Z compiled: bool, 2025-05-07T20:32:12.4819767Z ) -> None: 2025-05-07T20:32:12.4819974Z torch.manual_seed(2025) 2025-05-07T20:32:12.4820053Z 2025-05-07T20:32:12.4820276Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.4820354Z 2025-05-07T20:32:12.4820446Z x_sign = torch.sign(x) 2025-05-07T20:32:12.4820571Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.4820665Z x = x_sign * x_clamp 2025-05-07T20:32:12.4820747Z x0 = x[:, :D] 2025-05-07T20:32:12.4820829Z x1 = x[:, D:] 2025-05-07T20:32:12.4820906Z 2025-05-07T20:32:12.4820992Z if contiguous: 2025-05-07T20:32:12.4821091Z x0 = x0.contiguous() 2025-05-07T20:32:12.4821185Z x1 = x1.contiguous() 2025-05-07T20:32:12.4821258Z 2025-05-07T20:32:12.4821356Z if scale_ub is not None: 2025-05-07T20:32:12.4821462Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:12.4821602Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:12.4821688Z ) 2025-05-07T20:32:12.4821765Z else: 2025-05-07T20:32:12.4821862Z scale_ub_tensor = None 2025-05-07T20:32:12.4821941Z 2025-05-07T20:32:12.4822074Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.4822169Z op = silu_mul_quant 2025-05-07T20:32:12.4822257Z if compiled: 2025-05-07T20:32:12.4822359Z op = torch.compile(op) 2025-05-07T20:32:12.4822471Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.4822550Z 2025-05-07T20:32:12.4822641Z > y_fp8, y_scale = fn() 2025-05-07T20:32:12.4822645Z 2025-05-07T20:32:12.4822747Z moe/activation_test.py:117: 2025-05-07T20:32:12.4822883Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.4822986Z moe/activation_test.py:115: in fn 2025-05-07T20:32:12.4823084Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.4823594Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:12.4823744Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:12.4824106Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:12.4824326Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:12.4824664Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:12.4824766Z kernel = self.compile( 2025-05-07T20:32:12.4825147Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:12.4825402Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:12.4825536Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.4825540Z 2025-05-07T20:32:12.4825746Z self = 2025-05-07T20:32:12.4826544Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:12.4827045Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f06aa858940>} 2025-05-07T20:32:12.4827793Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:12.4827990Z context = 2025-05-07T20:32:12.4827995Z 2025-05-07T20:32:12.4828163Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:12.4828428Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:12.4828605Z module_map=module_map) 2025-05-07T20:32:12.4828779Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:12.4828877Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:12.4828955Z E ^ 2025-05-07T20:32:12.4829313Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:12.4829318Z 2025-05-07T20:32:12.4829726Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:12.4829732Z 2025-05-07T20:32:12.4829845Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.4830068Z self=, 2025-05-07T20:32:12.4830145Z T=2048, 2025-05-07T20:32:12.4830227Z D=7168, 2025-05-07T20:32:12.4830313Z scale_ub=None, 2025-05-07T20:32:12.4830396Z contiguous=False, 2025-05-07T20:32:12.4830486Z compiled=False, 2025-05-07T20:32:12.4830560Z ) 2025-05-07T20:32:12.4830773Z self = 2025-05-07T20:32:12.4830953Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:12.4830957Z 2025-05-07T20:32:12.4831033Z @given( 2025-05-07T20:32:12.4831150Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.4831256Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.4831369Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.4831492Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.4831605Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.4831681Z ) 2025-05-07T20:32:12.4831937Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.4832033Z def test_silu_mul_quant( 2025-05-07T20:32:12.4832110Z self, 2025-05-07T20:32:12.4832188Z T: int, 2025-05-07T20:32:12.4832312Z D: int, 2025-05-07T20:32:12.4832416Z scale_ub: Optional[float], 2025-05-07T20:32:12.4832512Z contiguous: bool, 2025-05-07T20:32:12.4832597Z compiled: bool, 2025-05-07T20:32:12.4832681Z ) -> None: 2025-05-07T20:32:12.4832778Z torch.manual_seed(2025) 2025-05-07T20:32:12.4832850Z 2025-05-07T20:32:12.4833021Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.4834866Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:12.4834914Z 2025-05-07T20:32:12.4835038Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:12.4835043Z 2025-05-07T20:32:12.4835146Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.4835364Z self=, 2025-05-07T20:32:12.4835446Z T=128, 2025-05-07T20:32:12.4835522Z D=7168, 2025-05-07T20:32:12.4835608Z scale_ub=1200.0, 2025-05-07T20:32:12.4835700Z contiguous=True, 2025-05-07T20:32:12.4835782Z compiled=True, 2025-05-07T20:32:12.4835859Z ) 2025-05-07T20:32:12.4836076Z self = 2025-05-07T20:32:12.4836245Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:12.4836250Z 2025-05-07T20:32:12.4836330Z @given( 2025-05-07T20:32:12.4836447Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.4836545Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.4836709Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.4836831Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.4836941Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.4837020Z ) 2025-05-07T20:32:12.4837272Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.4837373Z def test_silu_mul_quant( 2025-05-07T20:32:12.4837448Z self, 2025-05-07T20:32:12.4837526Z T: int, 2025-05-07T20:32:12.4837607Z D: int, 2025-05-07T20:32:12.4837711Z scale_ub: Optional[float], 2025-05-07T20:32:12.4837802Z contiguous: bool, 2025-05-07T20:32:12.4837892Z compiled: bool, 2025-05-07T20:32:12.4837971Z ) -> None: 2025-05-07T20:32:12.4838068Z torch.manual_seed(2025) 2025-05-07T20:32:12.4838142Z 2025-05-07T20:32:12.4838312Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.4838388Z 2025-05-07T20:32:12.4838484Z x_sign = torch.sign(x) 2025-05-07T20:32:12.4838610Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.4838703Z x = x_sign * x_clamp 2025-05-07T20:32:12.4838783Z x0 = x[:, :D] 2025-05-07T20:32:12.4838865Z x1 = x[:, D:] 2025-05-07T20:32:12.4838940Z 2025-05-07T20:32:12.4839025Z if contiguous: 2025-05-07T20:32:12.4839117Z x0 = x0.contiguous() 2025-05-07T20:32:12.4839211Z x1 = x1.contiguous() 2025-05-07T20:32:12.4839285Z 2025-05-07T20:32:12.4839380Z if scale_ub is not None: 2025-05-07T20:32:12.4839486Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:12.4839623Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:12.4839700Z ) 2025-05-07T20:32:12.4839784Z else: 2025-05-07T20:32:12.4839880Z scale_ub_tensor = None 2025-05-07T20:32:12.4839953Z 2025-05-07T20:32:12.4840090Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.4840229Z op = silu_mul_quant 2025-05-07T20:32:12.4840317Z if compiled: 2025-05-07T20:32:12.4840419Z op = torch.compile(op) 2025-05-07T20:32:12.4840526Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.4840607Z 2025-05-07T20:32:12.4840697Z > y_fp8, y_scale = fn() 2025-05-07T20:32:12.4840701Z 2025-05-07T20:32:12.4840800Z moe/activation_test.py:117: 2025-05-07T20:32:12.4840934Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.4841078Z moe/activation_test.py:115: in fn 2025-05-07T20:32:12.4841181Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.4842087Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:12.4842186Z return fn(*args, **kwargs) 2025-05-07T20:32:12.4842696Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:12.4842796Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:12.4843159Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:12.4843382Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:12.4843727Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:12.4843827Z kernel = self.compile( 2025-05-07T20:32:12.4844209Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:12.4844385Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:12.4844515Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.4844520Z 2025-05-07T20:32:12.4844725Z self = 2025-05-07T20:32:12.4845557Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:12.4846060Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f06aa858dc0>} 2025-05-07T20:32:12.4846805Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:12.4847001Z context = 2025-05-07T20:32:12.4847006Z 2025-05-07T20:32:12.4847171Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:12.4847441Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:12.4847551Z module_map=module_map) 2025-05-07T20:32:12.4847718Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:12.4847823Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:12.4847900Z E ^ 2025-05-07T20:32:12.4848253Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:12.4848258Z 2025-05-07T20:32:12.4848673Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:12.4848680Z 2025-05-07T20:32:12.4848784Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.4849010Z self=, 2025-05-07T20:32:12.4849088Z T=128, 2025-05-07T20:32:12.4849165Z D=7168, 2025-05-07T20:32:12.4849251Z scale_ub=1200.0, 2025-05-07T20:32:12.4849455Z contiguous=True, 2025-05-07T20:32:12.4849541Z compiled=False, 2025-05-07T20:32:12.4849618Z ) 2025-05-07T20:32:12.4849837Z self = 2025-05-07T20:32:12.4850010Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:12.4850014Z 2025-05-07T20:32:12.4850091Z @given( 2025-05-07T20:32:12.4850210Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.4850312Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.4850482Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.4850598Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.4850759Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.4850836Z ) 2025-05-07T20:32:12.4851083Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.4851179Z def test_silu_mul_quant( 2025-05-07T20:32:12.4851260Z self, 2025-05-07T20:32:12.4851344Z T: int, 2025-05-07T20:32:12.4851421Z D: int, 2025-05-07T20:32:12.4851520Z scale_ub: Optional[float], 2025-05-07T20:32:12.4851614Z contiguous: bool, 2025-05-07T20:32:12.4851699Z compiled: bool, 2025-05-07T20:32:12.4851778Z ) -> None: 2025-05-07T20:32:12.4851876Z torch.manual_seed(2025) 2025-05-07T20:32:12.4851950Z 2025-05-07T20:32:12.4852121Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.4852200Z 2025-05-07T20:32:12.4852297Z x_sign = torch.sign(x) 2025-05-07T20:32:12.4852424Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.4854236Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:12.4854244Z 2025-05-07T20:32:12.4854367Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:12.4854376Z 2025-05-07T20:32:12.4854481Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.4854701Z self=, 2025-05-07T20:32:12.4854784Z T=128, 2025-05-07T20:32:12.4854859Z D=5120, 2025-05-07T20:32:12.4854941Z scale_ub=1200.0, 2025-05-07T20:32:12.4855030Z contiguous=True, 2025-05-07T20:32:12.4855112Z compiled=True, 2025-05-07T20:32:12.4855187Z ) 2025-05-07T20:32:12.4855414Z self = 2025-05-07T20:32:12.4855582Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:12.4855591Z 2025-05-07T20:32:12.4855668Z @given( 2025-05-07T20:32:12.4855792Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.4855889Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.4856012Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.4856126Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.4856237Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.4856312Z ) 2025-05-07T20:32:12.4856554Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.4856650Z def test_silu_mul_quant( 2025-05-07T20:32:12.4856732Z self, 2025-05-07T20:32:12.4856808Z T: int, 2025-05-07T20:32:12.4856884Z D: int, 2025-05-07T20:32:12.4856985Z scale_ub: Optional[float], 2025-05-07T20:32:12.4857075Z contiguous: bool, 2025-05-07T20:32:12.4857166Z compiled: bool, 2025-05-07T20:32:12.4857289Z ) -> None: 2025-05-07T20:32:12.4857391Z torch.manual_seed(2025) 2025-05-07T20:32:12.4857468Z 2025-05-07T20:32:12.4857635Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.4857709Z 2025-05-07T20:32:12.4857808Z x_sign = torch.sign(x) 2025-05-07T20:32:12.4857935Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.4859740Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:12.4859788Z 2025-05-07T20:32:12.4859973Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:12.4859977Z 2025-05-07T20:32:12.4860081Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.4860302Z self=, 2025-05-07T20:32:12.4860378Z T=128, 2025-05-07T20:32:12.4860459Z D=7168, 2025-05-07T20:32:12.4860542Z scale_ub=None, 2025-05-07T20:32:12.4860629Z contiguous=True, 2025-05-07T20:32:12.4860720Z compiled=True, 2025-05-07T20:32:12.4860790Z ) 2025-05-07T20:32:12.4861010Z self = 2025-05-07T20:32:12.4861184Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:12.4861189Z 2025-05-07T20:32:12.4861267Z @given( 2025-05-07T20:32:12.4861384Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.4861484Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.4861599Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.4861770Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.4861887Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.4861966Z ) 2025-05-07T20:32:12.4862215Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.4862308Z def test_silu_mul_quant( 2025-05-07T20:32:12.4862385Z self, 2025-05-07T20:32:12.4862466Z T: int, 2025-05-07T20:32:12.4862541Z D: int, 2025-05-07T20:32:12.4862638Z scale_ub: Optional[float], 2025-05-07T20:32:12.4862737Z contiguous: bool, 2025-05-07T20:32:12.4862823Z compiled: bool, 2025-05-07T20:32:12.4862908Z ) -> None: 2025-05-07T20:32:12.4863004Z torch.manual_seed(2025) 2025-05-07T20:32:12.4863076Z 2025-05-07T20:32:12.4863249Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.4865021Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:12.4865029Z 2025-05-07T20:32:12.4865154Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:12.4865291Z =============================== warnings summary =============================== 2025-05-07T20:32:12.4865601Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:32:12.4865907Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:32:12.4866277Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:32:12.4867173Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details. 2025-05-07T20:32:12.4867401Z warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See " 2025-05-07T20:32:12.4867405Z 2025-05-07T20:32:12.4867664Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html 2025-05-07T20:32:12.4867869Z ================= 1 failed, 1 deselected, 3 warnings in 21.69s ================= 2025-05-07T20:32:14.1714508Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py` failed. (See above for error) 2025-05-07T20:32:14.2345037Z [EXEC] [ATTEMPT 0/2] Command attempt failed. 2025-05-07T20:32:14.2345361Z 2025-05-07T20:32:16.2361049Z [EXEC] [ATTEMPT 1/2] + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py 2025-05-07T20:32:18.3766558Z ============================= test session starts ============================== 2025-05-07T20:32:18.3767180Z platform linux -- Python 3.10.13, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:32:18.3767724Z cachedir: .pytest_cache 2025-05-07T20:32:18.3768306Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:32:18.3769024Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:32:18.3769433Z plugins: hypothesis-6.131.14 2025-05-07T20:32:19.9752452Z TMA benchmarks will be running with experimental grid constant TMA descriptor. 2025-05-07T20:32:20.1531998Z collecting ... collected 2 items / 1 deselected / 1 selected 2025-05-07T20:32:20.1532396Z run-last-failure: rerun previous 1 failure 2025-05-07T20:32:20.1532628Z 2025-05-07T20:32:22.6557800Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant( 2025-05-07T20:32:22.6558634Z self=, 2025-05-07T20:32:22.6559043Z T=1, 2025-05-07T20:32:22.6559243Z D=5120, 2025-05-07T20:32:22.6559462Z scale_ub=None, 2025-05-07T20:32:22.6559678Z contiguous=True, 2025-05-07T20:32:22.6559908Z compiled=True, 2025-05-07T20:32:22.6560110Z ) 2025-05-07T20:32:22.6560439Z self = 2025-05-07T20:32:22.6560930Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:22.6561191Z 2025-05-07T20:32:22.6561273Z @given( 2025-05-07T20:32:22.6561516Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:22.6561829Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:22.6562136Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:22.6562460Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:22.6562792Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:22.6563083Z ) 2025-05-07T20:32:22.6563435Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:22.6563882Z def test_silu_mul_quant( 2025-05-07T20:32:22.6564138Z self, 2025-05-07T20:32:22.6564337Z T: int, 2025-05-07T20:32:22.6564539Z D: int, 2025-05-07T20:32:22.6564770Z scale_ub: Optional[float], 2025-05-07T20:32:22.6565045Z contiguous: bool, 2025-05-07T20:32:22.6565368Z compiled: bool, 2025-05-07T20:32:22.6565672Z ) -> None: 2025-05-07T20:32:22.6565895Z torch.manual_seed(2025) 2025-05-07T20:32:22.6566508Z 2025-05-07T20:32:22.6566789Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:22.6567136Z 2025-05-07T20:32:22.6567328Z x_sign = torch.sign(x) 2025-05-07T20:32:22.6567622Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:22.6567934Z x = x_sign * x_clamp 2025-05-07T20:32:22.6568171Z x0 = x[:, :D] 2025-05-07T20:32:22.6568390Z x1 = x[:, D:] 2025-05-07T20:32:22.6568602Z 2025-05-07T20:32:22.6568788Z if contiguous: 2025-05-07T20:32:22.6569023Z x0 = x0.contiguous() 2025-05-07T20:32:22.6569385Z x1 = x1.contiguous() 2025-05-07T20:32:22.6569622Z 2025-05-07T20:32:22.6569910Z if scale_ub is not None: 2025-05-07T20:32:22.6570190Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:22.6570524Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:22.6570827Z ) 2025-05-07T20:32:22.6571023Z else: 2025-05-07T20:32:22.6571242Z scale_ub_tensor = None 2025-05-07T20:32:22.6571490Z 2025-05-07T20:32:22.6571725Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:22.6572041Z op = silu_mul_quant 2025-05-07T20:32:22.6572285Z if compiled: 2025-05-07T20:32:22.6572537Z op = torch.compile(op) 2025-05-07T20:32:22.6572837Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.6573104Z 2025-05-07T20:32:22.6573303Z y_fp8, y_scale = fn() 2025-05-07T20:32:22.6573590Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:22.6573873Z 2025-05-07T20:32:22.6574109Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:22.6574454Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:22.6574741Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:22.6575060Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:22.6575417Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:22.6575813Z 2025-05-07T20:32:22.6576018Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:22.6576219Z 2025-05-07T20:32:22.6576320Z moe/activation_test.py:126: 2025-05-07T20:32:22.6576625Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.6576951Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:22.6577282Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:22.6578079Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:22.6578834Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:22.6579382Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:22.6580218Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:22.6580912Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:22.6581642Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:22.6582384Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:22.6583128Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:22.6583854Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:22.6584495Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:22.6585096Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:22.6585615Z fn() 2025-05-07T20:32:22.6586122Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:22.6586764Z self.fn.run( 2025-05-07T20:32:22.6587234Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:22.6587772Z kernel = self.compile( 2025-05-07T20:32:22.6588304Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:22.6588956Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:22.6589350Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.6589705Z 2025-05-07T20:32:22.6590330Z self = 2025-05-07T20:32:22.6591416Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:22.6592802Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd09d57caf0>} 2025-05-07T20:32:22.6594131Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:22.6595161Z context = 2025-05-07T20:32:22.6595449Z 2025-05-07T20:32:22.6595623Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:22.6596136Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:22.6596598Z module_map=module_map) 2025-05-07T20:32:22.6596967Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:22.6597316Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:22.6597657Z E ^ 2025-05-07T20:32:22.6598122Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:22.6598564Z 2025-05-07T20:32:22.6598984Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:22.6599487Z 2025-05-07T20:32:22.6599594Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:22.6600003Z self=, 2025-05-07T20:32:22.6600405Z T=2048, 2025-05-07T20:32:22.6600591Z D=5120, 2025-05-07T20:32:22.6600786Z scale_ub=1200.0, 2025-05-07T20:32:22.6601013Z contiguous=True, 2025-05-07T20:32:22.6601229Z compiled=False, 2025-05-07T20:32:22.6601441Z ) 2025-05-07T20:32:24.0065886Z self = 2025-05-07T20:32:24.0066485Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:24.0066803Z 2025-05-07T20:32:24.0066886Z @given( 2025-05-07T20:32:24.0067135Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:24.0067466Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:24.0067783Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:24.0076189Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:24.0076586Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:24.0076877Z ) 2025-05-07T20:32:24.0077245Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:24.0077698Z def test_silu_mul_quant( 2025-05-07T20:32:24.0077949Z self, 2025-05-07T20:32:24.0078157Z T: int, 2025-05-07T20:32:24.0078365Z D: int, 2025-05-07T20:32:24.0078594Z scale_ub: Optional[float], 2025-05-07T20:32:24.0078873Z contiguous: bool, 2025-05-07T20:32:24.0079120Z compiled: bool, 2025-05-07T20:32:24.0079593Z ) -> None: 2025-05-07T20:32:24.0079831Z torch.manual_seed(2025) 2025-05-07T20:32:24.0080080Z 2025-05-07T20:32:24.0080356Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:24.0080712Z 2025-05-07T20:32:24.0080914Z x_sign = torch.sign(x) 2025-05-07T20:32:24.0081211Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:24.0081527Z x = x_sign * x_clamp 2025-05-07T20:32:24.0081779Z x0 = x[:, :D] 2025-05-07T20:32:24.0082001Z x1 = x[:, D:] 2025-05-07T20:32:24.0082300Z 2025-05-07T20:32:24.0082497Z if contiguous: 2025-05-07T20:32:24.0082743Z x0 = x0.contiguous() 2025-05-07T20:32:24.0083073Z x1 = x1.contiguous() 2025-05-07T20:32:24.0083327Z 2025-05-07T20:32:24.0083532Z if scale_ub is not None: 2025-05-07T20:32:24.0083811Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:24.0084155Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:24.0084478Z ) 2025-05-07T20:32:24.0084675Z else: 2025-05-07T20:32:24.0084901Z scale_ub_tensor = None 2025-05-07T20:32:24.0085161Z 2025-05-07T20:32:24.0085398Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:24.0085719Z op = silu_mul_quant 2025-05-07T20:32:24.0085980Z if compiled: 2025-05-07T20:32:24.0086235Z op = torch.compile(op) 2025-05-07T20:32:24.0086539Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:24.0086829Z 2025-05-07T20:32:24.0087033Z > y_fp8, y_scale = fn() 2025-05-07T20:32:24.0087210Z 2025-05-07T20:32:24.0087314Z moe/activation_test.py:117: 2025-05-07T20:32:24.0087629Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:24.0087971Z moe/activation_test.py:115: in fn 2025-05-07T20:32:24.0088255Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:24.0089028Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:24.0089742Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:24.0090551Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:24.0091251Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:24.0091926Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:24.0092469Z kernel = self.compile( 2025-05-07T20:32:24.0093011Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:24.0093671Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:24.0094076Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:24.0094305Z 2025-05-07T20:32:24.0094533Z self = 2025-05-07T20:32:24.0095602Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:24.0097004Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd09d45d990>} 2025-05-07T20:32:24.0098349Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:24.0099375Z context = 2025-05-07T20:32:24.0099665Z 2025-05-07T20:32:24.0099937Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:24.0100532Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:24.0101004Z module_map=module_map) 2025-05-07T20:32:24.0101380Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:24.0101741Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:24.0102000Z E ^ 2025-05-07T20:32:24.0102467Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:24.0102986Z 2025-05-07T20:32:24.0103414Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:24.0103976Z 2025-05-07T20:32:24.0104094Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:24.0104507Z self=, 2025-05-07T20:32:24.0104915Z T=2048, 2025-05-07T20:32:24.0105110Z D=5120, 2025-05-07T20:32:24.0105309Z scale_ub=1200.0, 2025-05-07T20:32:24.0105541Z contiguous=True, 2025-05-07T20:32:24.0105775Z compiled=True, 2025-05-07T20:32:24.0105984Z ) 2025-05-07T20:32:24.0106318Z self = 2025-05-07T20:32:24.0106816Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:24.0107085Z 2025-05-07T20:32:24.0107165Z @given( 2025-05-07T20:32:24.0107412Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:24.0107777Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:24.0108095Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:24.0108426Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:24.0108767Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:24.0109058Z ) 2025-05-07T20:32:24.0109407Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:24.0109855Z def test_silu_mul_quant( 2025-05-07T20:32:24.0110163Z self, 2025-05-07T20:32:24.0110364Z T: int, 2025-05-07T20:32:24.0110566Z D: int, 2025-05-07T20:32:24.0110791Z scale_ub: Optional[float], 2025-05-07T20:32:24.0111061Z contiguous: bool, 2025-05-07T20:32:24.0111310Z compiled: bool, 2025-05-07T20:32:24.0111541Z ) -> None: 2025-05-07T20:32:24.0111758Z torch.manual_seed(2025) 2025-05-07T20:32:24.0112005Z 2025-05-07T20:32:24.0112283Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:24.0112624Z 2025-05-07T20:32:24.0112826Z x_sign = torch.sign(x) 2025-05-07T20:32:24.0113125Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:24.0113437Z x = x_sign * x_clamp 2025-05-07T20:32:24.0113676Z x0 = x[:, :D] 2025-05-07T20:32:24.0113900Z x1 = x[:, D:] 2025-05-07T20:32:24.0114112Z 2025-05-07T20:32:24.0114300Z if contiguous: 2025-05-07T20:32:24.0114541Z x0 = x0.contiguous() 2025-05-07T20:32:24.0114808Z x1 = x1.contiguous() 2025-05-07T20:32:24.0115045Z 2025-05-07T20:32:24.0115242Z if scale_ub is not None: 2025-05-07T20:32:24.0115520Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:24.0115853Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:24.0116169Z ) 2025-05-07T20:32:24.0116370Z else: 2025-05-07T20:32:24.0116582Z scale_ub_tensor = None 2025-05-07T20:32:24.0116838Z 2025-05-07T20:32:24.0117077Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:24.0117395Z op = silu_mul_quant 2025-05-07T20:32:24.0117655Z if compiled: 2025-05-07T20:32:24.0117913Z op = torch.compile(op) 2025-05-07T20:32:24.0118215Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:24.0118492Z 2025-05-07T20:32:24.0118693Z y_fp8, y_scale = fn() 2025-05-07T20:32:24.0118981Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:24.0119340Z 2025-05-07T20:32:24.0119589Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:24.0119930Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:24.0120223Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:24.0120544Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:24.0120907Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:24.0121214Z 2025-05-07T20:32:24.0121424Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:24.0121676Z 2025-05-07T20:32:24.0121779Z moe/activation_test.py:126: 2025-05-07T20:32:24.0122126Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:24.0122460Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:24.0122793Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:24.0123590Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:24.0124337Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:24.0124888Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:24.0125567Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:24.0126251Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:24.0126970Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:24.0127758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:24.0128526Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:24.0129296Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:24.0129943Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:24.0130543Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:24.0131064Z fn() 2025-05-07T20:32:24.0131566Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:24.0132146Z self.fn.run( 2025-05-07T20:32:24.0132618Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:24.0133160Z kernel = self.compile( 2025-05-07T20:32:24.0133699Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:24.0134359Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:24.0134757Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:24.0134986Z 2025-05-07T20:32:24.0135197Z self = 2025-05-07T20:32:24.0136270Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:24.0137630Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd097e2d3f0>} 2025-05-07T20:32:24.0138966Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:24.0140045Z context = 2025-05-07T20:32:24.0140378Z 2025-05-07T20:32:24.0140548Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:24.0141076Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:24.0141544Z module_map=module_map) 2025-05-07T20:32:24.0141913Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:24.0142270Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:24.0142542Z E ^ 2025-05-07T20:32:24.0143009Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:24.0143506Z 2025-05-07T20:32:24.0143983Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:24.0144512Z 2025-05-07T20:32:24.0144620Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:24.0145038Z self=, 2025-05-07T20:32:24.0145447Z T=16384, 2025-05-07T20:32:24.0145646Z D=7168, 2025-05-07T20:32:24.0145848Z scale_ub=1200.0, 2025-05-07T20:32:24.0146083Z contiguous=False, 2025-05-07T20:32:24.0146310Z compiled=False, 2025-05-07T20:32:24.0146519Z ) 2025-05-07T20:32:25.2086747Z self = 2025-05-07T20:32:25.2087322Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:25.2087628Z 2025-05-07T20:32:25.2087721Z @given( 2025-05-07T20:32:25.2087980Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.2088303Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.2088618Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.2088966Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.2089309Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.2089600Z ) 2025-05-07T20:32:25.2090340Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.2090796Z def test_silu_mul_quant( 2025-05-07T20:32:25.2091040Z self, 2025-05-07T20:32:25.2091250Z T: int, 2025-05-07T20:32:25.2091460Z D: int, 2025-05-07T20:32:25.2091683Z scale_ub: Optional[float], 2025-05-07T20:32:25.2091967Z contiguous: bool, 2025-05-07T20:32:25.2092226Z compiled: bool, 2025-05-07T20:32:25.2092464Z ) -> None: 2025-05-07T20:32:25.2092687Z torch.manual_seed(2025) 2025-05-07T20:32:25.2092945Z 2025-05-07T20:32:25.2093232Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.2093572Z 2025-05-07T20:32:25.2093783Z x_sign = torch.sign(x) 2025-05-07T20:32:25.2094075Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.2094383Z x = x_sign * x_clamp 2025-05-07T20:32:25.2094634Z x0 = x[:, :D] 2025-05-07T20:32:25.2094864Z x1 = x[:, D:] 2025-05-07T20:32:25.2095077Z 2025-05-07T20:32:25.2095271Z if contiguous: 2025-05-07T20:32:25.2095513Z x0 = x0.contiguous() 2025-05-07T20:32:25.2095774Z x1 = x1.contiguous() 2025-05-07T20:32:25.2096021Z 2025-05-07T20:32:25.2096225Z if scale_ub is not None: 2025-05-07T20:32:25.2096500Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.2096841Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.2097148Z ) 2025-05-07T20:32:25.2097335Z else: 2025-05-07T20:32:25.2097546Z scale_ub_tensor = None 2025-05-07T20:32:25.2097789Z 2025-05-07T20:32:25.2098019Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.2098323Z op = silu_mul_quant 2025-05-07T20:32:25.2098575Z if compiled: 2025-05-07T20:32:25.2098827Z op = torch.compile(op) 2025-05-07T20:32:25.2099121Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.2099535Z 2025-05-07T20:32:25.2099734Z > y_fp8, y_scale = fn() 2025-05-07T20:32:25.2099998Z 2025-05-07T20:32:25.2100099Z moe/activation_test.py:117: 2025-05-07T20:32:25.2100401Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.2100736Z moe/activation_test.py:115: in fn 2025-05-07T20:32:25.2101016Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.2101708Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:25.2102471Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:25.2103067Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.2103745Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.2104409Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.2104946Z kernel = self.compile( 2025-05-07T20:32:25.2105489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.2106139Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.2106540Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.2106766Z 2025-05-07T20:32:25.2106982Z self = 2025-05-07T20:32:25.2108059Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.2109439Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd097e2ce50>} 2025-05-07T20:32:25.2110822Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.2111858Z context = 2025-05-07T20:32:25.2112147Z 2025-05-07T20:32:25.2112321Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.2112830Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.2113300Z module_map=module_map) 2025-05-07T20:32:25.2113672Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.2114028Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.2114283Z E ^ 2025-05-07T20:32:25.2114745Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.2115198Z 2025-05-07T20:32:25.2115622Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.2116129Z 2025-05-07T20:32:25.2116232Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.2116643Z self=, 2025-05-07T20:32:25.2117044Z T=1, 2025-05-07T20:32:25.2117229Z D=7168, 2025-05-07T20:32:25.2117420Z scale_ub=None, 2025-05-07T20:32:25.2117632Z contiguous=True, 2025-05-07T20:32:25.2117859Z compiled=True, 2025-05-07T20:32:25.2118084Z ) 2025-05-07T20:32:25.2118401Z self = 2025-05-07T20:32:25.2118889Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:25.2119148Z 2025-05-07T20:32:25.2119230Z @given( 2025-05-07T20:32:25.2119459Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.2119815Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.2120122Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.2120442Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.2120769Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.2121055Z ) 2025-05-07T20:32:25.2121408Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.2121844Z def test_silu_mul_quant( 2025-05-07T20:32:25.2122083Z self, 2025-05-07T20:32:25.2122279Z T: int, 2025-05-07T20:32:25.2122523Z D: int, 2025-05-07T20:32:25.2122748Z scale_ub: Optional[float], 2025-05-07T20:32:25.2123025Z contiguous: bool, 2025-05-07T20:32:25.2123300Z compiled: bool, 2025-05-07T20:32:25.2123528Z ) -> None: 2025-05-07T20:32:25.2123751Z torch.manual_seed(2025) 2025-05-07T20:32:25.2123992Z 2025-05-07T20:32:25.2124269Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.2124617Z 2025-05-07T20:32:25.2124811Z x_sign = torch.sign(x) 2025-05-07T20:32:25.2125106Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.2125419Z x = x_sign * x_clamp 2025-05-07T20:32:25.2125657Z x0 = x[:, :D] 2025-05-07T20:32:25.2125879Z x1 = x[:, D:] 2025-05-07T20:32:25.2126091Z 2025-05-07T20:32:25.2126272Z if contiguous: 2025-05-07T20:32:25.2126507Z x0 = x0.contiguous() 2025-05-07T20:32:25.2126773Z x1 = x1.contiguous() 2025-05-07T20:32:25.2127015Z 2025-05-07T20:32:25.2127202Z if scale_ub is not None: 2025-05-07T20:32:25.2127475Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.2127811Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.2128114Z ) 2025-05-07T20:32:25.2128305Z else: 2025-05-07T20:32:25.2128523Z scale_ub_tensor = None 2025-05-07T20:32:25.2128765Z 2025-05-07T20:32:25.2129048Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.2129360Z op = silu_mul_quant 2025-05-07T20:32:25.2129608Z if compiled: 2025-05-07T20:32:25.2129858Z op = torch.compile(op) 2025-05-07T20:32:25.2130155Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.2130423Z 2025-05-07T20:32:25.2130617Z y_fp8, y_scale = fn() 2025-05-07T20:32:25.2130910Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:25.2131194Z 2025-05-07T20:32:25.2131440Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.2131780Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:25.2132077Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:25.2132385Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:25.2132745Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:25.2133054Z 2025-05-07T20:32:25.2133256Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:25.2133460Z 2025-05-07T20:32:25.2133560Z moe/activation_test.py:126: 2025-05-07T20:32:25.2133857Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.2134183Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:25.2134515Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:25.2135298Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:25.2136051Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:25.2136591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.2137271Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.2137962Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:25.2138743Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:25.2139492Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:25.2140318Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:25.2141050Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:25.2141689Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:25.2142372Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:25.2142901Z fn() 2025-05-07T20:32:25.2143410Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:25.2143981Z self.fn.run( 2025-05-07T20:32:25.2144453Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.2144984Z kernel = self.compile( 2025-05-07T20:32:25.2145520Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.2146171Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.2146568Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.2146793Z 2025-05-07T20:32:25.2147016Z self = 2025-05-07T20:32:25.2148085Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.2149507Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd097bc5000>} 2025-05-07T20:32:25.2150855Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.2151881Z context = 2025-05-07T20:32:25.2152164Z 2025-05-07T20:32:25.2152335Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.2152854Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.2153325Z module_map=module_map) 2025-05-07T20:32:25.2153691Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.2154042Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:25.2154308Z E ^ 2025-05-07T20:32:25.2154773Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.2155227Z 2025-05-07T20:32:25.2155645Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.2156167Z 2025-05-07T20:32:25.2156270Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.2156677Z self=, 2025-05-07T20:32:25.2157081Z T=4096, 2025-05-07T20:32:25.2157267Z D=5120, 2025-05-07T20:32:25.2157460Z scale_ub=None, 2025-05-07T20:32:25.2157677Z contiguous=False, 2025-05-07T20:32:25.2157917Z compiled=False, 2025-05-07T20:32:25.2158155Z ) 2025-05-07T20:32:26.7809845Z self = 2025-05-07T20:32:26.7810457Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:26.7810741Z 2025-05-07T20:32:26.7810956Z @given( 2025-05-07T20:32:26.7811207Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:26.7811540Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:26.7811855Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:26.7812191Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:26.7812527Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:26.7812829Z ) 2025-05-07T20:32:26.7813182Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:26.7813748Z def test_silu_mul_quant( 2025-05-07T20:32:26.7814081Z self, 2025-05-07T20:32:26.7814341Z T: int, 2025-05-07T20:32:26.7821410Z D: int, 2025-05-07T20:32:26.7821648Z scale_ub: Optional[float], 2025-05-07T20:32:26.7821927Z contiguous: bool, 2025-05-07T20:32:26.7822169Z compiled: bool, 2025-05-07T20:32:26.7822389Z ) -> None: 2025-05-07T20:32:26.7822602Z torch.manual_seed(2025) 2025-05-07T20:32:26.7822846Z 2025-05-07T20:32:26.7823122Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:26.7823479Z 2025-05-07T20:32:26.7823686Z x_sign = torch.sign(x) 2025-05-07T20:32:26.7823986Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:26.7824307Z x = x_sign * x_clamp 2025-05-07T20:32:26.7824558Z x0 = x[:, :D] 2025-05-07T20:32:26.7824778Z x1 = x[:, D:] 2025-05-07T20:32:26.7824998Z 2025-05-07T20:32:26.7825194Z if contiguous: 2025-05-07T20:32:26.7825443Z x0 = x0.contiguous() 2025-05-07T20:32:26.7825704Z x1 = x1.contiguous() 2025-05-07T20:32:26.7825950Z 2025-05-07T20:32:26.7826157Z if scale_ub is not None: 2025-05-07T20:32:26.7826433Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:26.7826774Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:26.7827090Z ) 2025-05-07T20:32:26.7827284Z else: 2025-05-07T20:32:26.7827579Z scale_ub_tensor = None 2025-05-07T20:32:26.7827843Z 2025-05-07T20:32:26.7828081Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:26.7828408Z op = silu_mul_quant 2025-05-07T20:32:26.7828666Z if compiled: 2025-05-07T20:32:26.7828919Z op = torch.compile(op) 2025-05-07T20:32:26.7829223Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:26.7829505Z 2025-05-07T20:32:26.7829700Z > y_fp8, y_scale = fn() 2025-05-07T20:32:26.7829874Z 2025-05-07T20:32:26.7829983Z moe/activation_test.py:117: 2025-05-07T20:32:26.7830295Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:26.7832039Z moe/activation_test.py:115: in fn 2025-05-07T20:32:26.7832322Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:26.7833015Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:26.7833718Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:26.7834251Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:26.7834936Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:26.7835602Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:26.7836134Z kernel = self.compile( 2025-05-07T20:32:26.7836674Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:26.7837335Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:26.7837734Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:26.7837960Z 2025-05-07T20:32:26.7838178Z self = 2025-05-07T20:32:26.7839306Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:26.7840672Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd097bc5a20>} 2025-05-07T20:32:26.7842003Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:26.7843110Z context = 2025-05-07T20:32:26.7843398Z 2025-05-07T20:32:26.7843568Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:26.7844090Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:26.7844567Z module_map=module_map) 2025-05-07T20:32:26.7844934Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:26.7845281Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:26.7845547Z E ^ 2025-05-07T20:32:26.7846018Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:26.7846459Z 2025-05-07T20:32:26.7846880Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:26.7847387Z 2025-05-07T20:32:26.7847492Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:26.7847913Z self=, 2025-05-07T20:32:26.7848317Z T=4096, 2025-05-07T20:32:26.7848503Z D=7168, 2025-05-07T20:32:26.7848727Z scale_ub=None, 2025-05-07T20:32:26.7848971Z contiguous=False, 2025-05-07T20:32:26.7849265Z compiled=False, 2025-05-07T20:32:26.7849485Z ) 2025-05-07T20:32:26.7849809Z self = 2025-05-07T20:32:26.7850304Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:26.7850583Z 2025-05-07T20:32:26.7850655Z @given( 2025-05-07T20:32:26.7850889Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:26.7851200Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:26.7851508Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:26.7851841Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:26.7852166Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:26.7852454Z ) 2025-05-07T20:32:26.7852806Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:26.7853241Z def test_silu_mul_quant( 2025-05-07T20:32:26.7853485Z self, 2025-05-07T20:32:26.7853689Z T: int, 2025-05-07T20:32:26.7853889Z D: int, 2025-05-07T20:32:26.7854114Z scale_ub: Optional[float], 2025-05-07T20:32:26.7854392Z contiguous: bool, 2025-05-07T20:32:26.7854629Z compiled: bool, 2025-05-07T20:32:26.7854851Z ) -> None: 2025-05-07T20:32:26.7855070Z torch.manual_seed(2025) 2025-05-07T20:32:26.7855304Z 2025-05-07T20:32:26.7855579Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:26.7855922Z 2025-05-07T20:32:26.7856112Z x_sign = torch.sign(x) 2025-05-07T20:32:26.7856407Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:26.7856722Z x = x_sign * x_clamp 2025-05-07T20:32:26.7856969Z x0 = x[:, :D] 2025-05-07T20:32:26.7857179Z x1 = x[:, D:] 2025-05-07T20:32:26.7857386Z 2025-05-07T20:32:26.7857575Z if contiguous: 2025-05-07T20:32:26.7857799Z x0 = x0.contiguous() 2025-05-07T20:32:26.7858056Z x1 = x1.contiguous() 2025-05-07T20:32:26.7858353Z 2025-05-07T20:32:26.7858564Z if scale_ub is not None: 2025-05-07T20:32:26.7858868Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:26.7859201Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:26.7859502Z ) 2025-05-07T20:32:26.7859699Z else: 2025-05-07T20:32:26.7860007Z scale_ub_tensor = None 2025-05-07T20:32:26.7860248Z 2025-05-07T20:32:26.7860484Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:26.7860801Z op = silu_mul_quant 2025-05-07T20:32:26.7861103Z if compiled: 2025-05-07T20:32:26.7861356Z op = torch.compile(op) 2025-05-07T20:32:26.7861692Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:26.7861968Z 2025-05-07T20:32:26.7862164Z > y_fp8, y_scale = fn() 2025-05-07T20:32:26.7862334Z 2025-05-07T20:32:26.7862435Z moe/activation_test.py:117: 2025-05-07T20:32:26.7862737Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:26.7863062Z moe/activation_test.py:115: in fn 2025-05-07T20:32:26.7863352Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:26.7864034Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:26.7864711Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:26.7865250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:26.7865934Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:26.7866594Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:26.7867117Z kernel = self.compile( 2025-05-07T20:32:26.7867660Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:26.7868361Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:26.7868763Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:26.7868989Z 2025-05-07T20:32:26.7869198Z self = 2025-05-07T20:32:26.7870264Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:26.7871646Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd097bc6560>} 2025-05-07T20:32:26.7872982Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:26.7873994Z context = 2025-05-07T20:32:26.7874286Z 2025-05-07T20:32:26.7874469Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:26.7874992Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:26.7875457Z module_map=module_map) 2025-05-07T20:32:26.7875816Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:26.7876169Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:26.7876436Z E ^ 2025-05-07T20:32:26.7876899Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:26.7877342Z 2025-05-07T20:32:26.7877752Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:26.7878273Z 2025-05-07T20:32:26.7878429Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:26.7878848Z self=, 2025-05-07T20:32:26.7879238Z T=128, 2025-05-07T20:32:26.7879422Z D=7168, 2025-05-07T20:32:26.7879621Z scale_ub=None, 2025-05-07T20:32:26.7879838Z contiguous=False, 2025-05-07T20:32:26.7880059Z compiled=True, 2025-05-07T20:32:26.7880260Z ) 2025-05-07T20:32:26.8505870Z self = 2025-05-07T20:32:26.8506412Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:26.8506812Z 2025-05-07T20:32:26.8506897Z @given( 2025-05-07T20:32:26.8507200Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:26.8507521Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:26.8507832Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:26.8508179Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:26.8508521Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:26.8508820Z ) 2025-05-07T20:32:26.8509173Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:26.8509629Z def test_silu_mul_quant( 2025-05-07T20:32:26.8509882Z self, 2025-05-07T20:32:26.8510076Z T: int, 2025-05-07T20:32:26.8510282Z D: int, 2025-05-07T20:32:26.8510517Z scale_ub: Optional[float], 2025-05-07T20:32:26.8510790Z contiguous: bool, 2025-05-07T20:32:26.8511040Z compiled: bool, 2025-05-07T20:32:26.8511281Z ) -> None: 2025-05-07T20:32:26.8511497Z torch.manual_seed(2025) 2025-05-07T20:32:26.8511753Z 2025-05-07T20:32:26.8512042Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:26.8512378Z 2025-05-07T20:32:26.8512581Z x_sign = torch.sign(x) 2025-05-07T20:32:26.8512885Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:26.8513194Z x = x_sign * x_clamp 2025-05-07T20:32:26.8513514Z x0 = x[:, :D] 2025-05-07T20:32:26.8513743Z x1 = x[:, D:] 2025-05-07T20:32:26.8513956Z 2025-05-07T20:32:26.8514143Z if contiguous: 2025-05-07T20:32:26.8514383Z x0 = x0.contiguous() 2025-05-07T20:32:26.8514645Z x1 = x1.contiguous() 2025-05-07T20:32:26.8514883Z 2025-05-07T20:32:26.8515082Z if scale_ub is not None: 2025-05-07T20:32:26.8515358Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:26.8515689Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:26.8515999Z ) 2025-05-07T20:32:26.8516194Z else: 2025-05-07T20:32:26.8516403Z scale_ub_tensor = None 2025-05-07T20:32:26.8516656Z 2025-05-07T20:32:26.8516895Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:26.8517214Z op = silu_mul_quant 2025-05-07T20:32:26.8517460Z if compiled: 2025-05-07T20:32:26.8517720Z op = torch.compile(op) 2025-05-07T20:32:26.8518023Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:26.8518289Z 2025-05-07T20:32:26.8518485Z y_fp8, y_scale = fn() 2025-05-07T20:32:26.8518773Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:26.8519057Z 2025-05-07T20:32:26.8519295Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:26.8519629Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:26.8519923Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:26.8520232Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:26.8520595Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:26.8520911Z 2025-05-07T20:32:26.8521110Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:26.8521313Z 2025-05-07T20:32:26.8521414Z moe/activation_test.py:126: 2025-05-07T20:32:26.8521717Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:26.8522146Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:26.8522475Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:26.8523260Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:26.8524007Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:26.8524543Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:26.8525262Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:26.8525981Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:26.8526714Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:26.8527456Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:26.8528197Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:26.8528917Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:26.8529558Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:26.8530148Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:26.8530668Z fn() 2025-05-07T20:32:26.8531178Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:26.8531747Z self.fn.run( 2025-05-07T20:32:26.8532214Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:26.8532744Z kernel = self.compile( 2025-05-07T20:32:26.8533328Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:26.8533983Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:26.8534384Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:26.8534607Z 2025-05-07T20:32:26.8534818Z self = 2025-05-07T20:32:26.8535889Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:26.8537260Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd097bca680>} 2025-05-07T20:32:26.8538596Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:26.8539626Z context = 2025-05-07T20:32:26.8539977Z 2025-05-07T20:32:26.8540150Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:26.8540662Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:26.8541131Z module_map=module_map) 2025-05-07T20:32:26.8541503Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:26.8541860Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:26.8542123Z E ^ 2025-05-07T20:32:26.8542588Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:26.8543029Z 2025-05-07T20:32:26.8543458Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:26.8544011Z 2025-05-07T20:32:26.8544122Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:26.8544527Z self=, 2025-05-07T20:32:26.8544929Z T=128, 2025-05-07T20:32:26.8545114Z D=7168, 2025-05-07T20:32:26.8545304Z scale_ub=None, 2025-05-07T20:32:26.8545520Z contiguous=False, 2025-05-07T20:32:26.8545748Z compiled=False, 2025-05-07T20:32:26.8545948Z ) 2025-05-07T20:32:27.2181571Z self = 2025-05-07T20:32:27.2182298Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:27.2182573Z 2025-05-07T20:32:27.2182655Z @given( 2025-05-07T20:32:27.2182901Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:27.2183225Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:27.2183532Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:27.2183884Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:27.2184223Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:27.2184526Z ) 2025-05-07T20:32:27.2184877Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:27.2185321Z def test_silu_mul_quant( 2025-05-07T20:32:27.2185571Z self, 2025-05-07T20:32:27.2185766Z T: int, 2025-05-07T20:32:27.2186006Z D: int, 2025-05-07T20:32:27.2186238Z scale_ub: Optional[float], 2025-05-07T20:32:27.2186524Z contiguous: bool, 2025-05-07T20:32:27.2186773Z compiled: bool, 2025-05-07T20:32:27.2187000Z ) -> None: 2025-05-07T20:32:27.2187232Z torch.manual_seed(2025) 2025-05-07T20:32:27.2187482Z 2025-05-07T20:32:27.2187753Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:27.2188106Z 2025-05-07T20:32:27.2188315Z x_sign = torch.sign(x) 2025-05-07T20:32:27.2188679Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:27.2189002Z x = x_sign * x_clamp 2025-05-07T20:32:27.2189243Z x0 = x[:, :D] 2025-05-07T20:32:27.2189455Z x1 = x[:, D:] 2025-05-07T20:32:27.2189668Z 2025-05-07T20:32:27.2190107Z if contiguous: 2025-05-07T20:32:27.2190341Z x0 = x0.contiguous() 2025-05-07T20:32:27.2190599Z x1 = x1.contiguous() 2025-05-07T20:32:27.2190837Z 2025-05-07T20:32:27.2191025Z if scale_ub is not None: 2025-05-07T20:32:27.2191306Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:27.2191639Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:27.2191947Z ) 2025-05-07T20:32:27.2192137Z else: 2025-05-07T20:32:27.2192349Z scale_ub_tensor = None 2025-05-07T20:32:27.2192600Z 2025-05-07T20:32:27.2192828Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:27.2193149Z op = silu_mul_quant 2025-05-07T20:32:27.2193401Z if compiled: 2025-05-07T20:32:27.2193645Z op = torch.compile(op) 2025-05-07T20:32:27.2193942Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:27.2194210Z 2025-05-07T20:32:27.2194400Z > y_fp8, y_scale = fn() 2025-05-07T20:32:27.2194571Z 2025-05-07T20:32:27.2194672Z moe/activation_test.py:117: 2025-05-07T20:32:27.2194970Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:27.2195294Z moe/activation_test.py:115: in fn 2025-05-07T20:32:27.2195580Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:27.2196273Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:27.2196965Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:27.2197496Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:27.2198254Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:27.2198914Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:27.2199442Z kernel = self.compile( 2025-05-07T20:32:27.2199978Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:27.2200630Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:27.2201026Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:27.2201316Z 2025-05-07T20:32:27.2201576Z self = 2025-05-07T20:32:27.2202652Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:27.2204030Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd097c25f30>} 2025-05-07T20:32:27.2205358Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:27.2206375Z context = 2025-05-07T20:32:27.2206662Z 2025-05-07T20:32:27.2206828Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:27.2207358Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:27.2207821Z module_map=module_map) 2025-05-07T20:32:27.2208181Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:27.2208598Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:27.2208866Z E ^ 2025-05-07T20:32:27.2209329Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:27.2209772Z 2025-05-07T20:32:27.2210184Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:27.2210696Z 2025-05-07T20:32:27.2210799Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:27.2211211Z self=, 2025-05-07T20:32:27.2211618Z T=4096, 2025-05-07T20:32:27.2211800Z D=5120, 2025-05-07T20:32:27.2211996Z scale_ub=1200.0, 2025-05-07T20:32:27.2212219Z contiguous=True, 2025-05-07T20:32:27.2212437Z compiled=False, 2025-05-07T20:32:27.2212644Z ) 2025-05-07T20:32:27.2212961Z self = 2025-05-07T20:32:27.2213455Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:27.2213734Z 2025-05-07T20:32:27.2213811Z @given( 2025-05-07T20:32:27.2214043Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:27.2214350Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:27.2214656Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:27.2214986Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:27.2215313Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:27.2215591Z ) 2025-05-07T20:32:27.2215944Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:27.2216382Z def test_silu_mul_quant( 2025-05-07T20:32:27.2216623Z self, 2025-05-07T20:32:27.2216821Z T: int, 2025-05-07T20:32:27.2217020Z D: int, 2025-05-07T20:32:27.2217237Z scale_ub: Optional[float], 2025-05-07T20:32:27.2217508Z contiguous: bool, 2025-05-07T20:32:27.2217801Z compiled: bool, 2025-05-07T20:32:27.2218023Z ) -> None: 2025-05-07T20:32:27.2218246Z torch.manual_seed(2025) 2025-05-07T20:32:27.2218490Z 2025-05-07T20:32:27.2218757Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:27.2219097Z 2025-05-07T20:32:27.2219290Z x_sign = torch.sign(x) 2025-05-07T20:32:27.2219577Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:27.2219990Z x = x_sign * x_clamp 2025-05-07T20:32:27.2220233Z x0 = x[:, :D] 2025-05-07T20:32:27.2220451Z x1 = x[:, D:] 2025-05-07T20:32:27.2220703Z 2025-05-07T20:32:27.2220887Z if contiguous: 2025-05-07T20:32:27.2221162Z x0 = x0.contiguous() 2025-05-07T20:32:27.2221419Z x1 = x1.contiguous() 2025-05-07T20:32:27.2221660Z 2025-05-07T20:32:27.2221855Z if scale_ub is not None: 2025-05-07T20:32:27.2222124Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:27.2222459Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:27.2222774Z ) 2025-05-07T20:32:27.2222959Z else: 2025-05-07T20:32:27.2223171Z scale_ub_tensor = None 2025-05-07T20:32:27.2223427Z 2025-05-07T20:32:27.2223651Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:27.2223962Z op = silu_mul_quant 2025-05-07T20:32:27.2224222Z if compiled: 2025-05-07T20:32:27.2230141Z op = torch.compile(op) 2025-05-07T20:32:27.2230463Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:27.2230751Z 2025-05-07T20:32:27.2230944Z > y_fp8, y_scale = fn() 2025-05-07T20:32:27.2231120Z 2025-05-07T20:32:27.2231224Z moe/activation_test.py:117: 2025-05-07T20:32:27.2231520Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:27.2231852Z moe/activation_test.py:115: in fn 2025-05-07T20:32:27.2232131Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:27.2232898Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:27.2233606Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:27.2234136Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:27.2234814Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:27.2235476Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:27.2236014Z kernel = self.compile( 2025-05-07T20:32:27.2236556Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:27.2237210Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:27.2237607Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:27.2237835Z 2025-05-07T20:32:27.2238050Z self = 2025-05-07T20:32:27.2239116Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:27.2240480Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd097c25b40>} 2025-05-07T20:32:27.2241825Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:27.2242860Z context = 2025-05-07T20:32:27.2243144Z 2025-05-07T20:32:27.2243314Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:27.2243888Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:27.2244353Z module_map=module_map) 2025-05-07T20:32:27.2244719Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:27.2245070Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:27.2245333Z E ^ 2025-05-07T20:32:27.2245800Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:27.2246289Z 2025-05-07T20:32:27.2246747Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:27.2247261Z 2025-05-07T20:32:27.2247366Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:27.2247778Z self=, 2025-05-07T20:32:27.2248177Z T=1, 2025-05-07T20:32:27.2248364Z D=5120, 2025-05-07T20:32:27.2248566Z scale_ub=None, 2025-05-07T20:32:27.2248781Z contiguous=True, 2025-05-07T20:32:27.2249005Z compiled=True, 2025-05-07T20:32:27.2249209Z ) 2025-05-07T20:32:27.8005557Z self = 2025-05-07T20:32:27.8006098Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:27.8006364Z 2025-05-07T20:32:27.8006445Z @given( 2025-05-07T20:32:27.8006691Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:27.8007010Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:27.8007318Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:27.8007664Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:27.8008005Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:27.8008284Z ) 2025-05-07T20:32:27.8008645Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:27.8009212Z def test_silu_mul_quant( 2025-05-07T20:32:27.8009471Z self, 2025-05-07T20:32:27.8009664Z T: int, 2025-05-07T20:32:27.8009876Z D: int, 2025-05-07T20:32:27.8010099Z scale_ub: Optional[float], 2025-05-07T20:32:27.8010374Z contiguous: bool, 2025-05-07T20:32:27.8010617Z compiled: bool, 2025-05-07T20:32:27.8010845Z ) -> None: 2025-05-07T20:32:27.8011064Z torch.manual_seed(2025) 2025-05-07T20:32:27.8011313Z 2025-05-07T20:32:27.8011589Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:27.8011927Z 2025-05-07T20:32:27.8012127Z x_sign = torch.sign(x) 2025-05-07T20:32:27.8012431Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:27.8012736Z x = x_sign * x_clamp 2025-05-07T20:32:27.8012988Z x0 = x[:, :D] 2025-05-07T20:32:27.8013196Z x1 = x[:, D:] 2025-05-07T20:32:27.8013411Z 2025-05-07T20:32:27.8013599Z if contiguous: 2025-05-07T20:32:27.8013843Z x0 = x0.contiguous() 2025-05-07T20:32:27.8014108Z x1 = x1.contiguous() 2025-05-07T20:32:27.8014353Z 2025-05-07T20:32:27.8014542Z if scale_ub is not None: 2025-05-07T20:32:27.8014823Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:27.8015157Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:27.8015461Z ) 2025-05-07T20:32:27.8015660Z else: 2025-05-07T20:32:27.8015875Z scale_ub_tensor = None 2025-05-07T20:32:27.8016124Z 2025-05-07T20:32:27.8016376Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:27.8016694Z op = silu_mul_quant 2025-05-07T20:32:27.8016943Z if compiled: 2025-05-07T20:32:27.8017196Z op = torch.compile(op) 2025-05-07T20:32:27.8017491Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:27.8017763Z 2025-05-07T20:32:27.8017953Z y_fp8, y_scale = fn() 2025-05-07T20:32:27.8018315Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:27.8018606Z 2025-05-07T20:32:27.8018838Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:27.8019174Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:27.8019471Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:27.8019855Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:27.8020215Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:27.8020530Z 2025-05-07T20:32:27.8020729Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:27.8021014Z 2025-05-07T20:32:27.8021115Z moe/activation_test.py:126: 2025-05-07T20:32:27.8021480Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:27.8021812Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:27.8022131Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:27.8022920Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:27.8023669Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:27.8024206Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:27.8024880Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:27.8025567Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:27.8026286Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:27.8027030Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:27.8027788Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:27.8028554Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:27.8029195Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:27.8029785Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:27.8030305Z fn() 2025-05-07T20:32:27.8030812Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:27.8031385Z self.fn.run( 2025-05-07T20:32:27.8031848Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:27.8032376Z kernel = self.compile( 2025-05-07T20:32:27.8032911Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:27.8033550Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:27.8033945Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:27.8034172Z 2025-05-07T20:32:27.8034381Z self = 2025-05-07T20:32:27.8035451Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:27.8036825Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd097c271c0>} 2025-05-07T20:32:27.8038157Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:27.8039182Z context = 2025-05-07T20:32:27.8039515Z 2025-05-07T20:32:27.8039690Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:27.8040206Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:27.8040664Z module_map=module_map) 2025-05-07T20:32:27.8041033Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:27.8041385Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:27.8041641Z E ^ 2025-05-07T20:32:27.8042095Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:27.8042578Z 2025-05-07T20:32:27.8043038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:27.8043551Z 2025-05-07T20:32:27.8043659Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:27.8044063Z self=, 2025-05-07T20:32:27.8044461Z T=2048, 2025-05-07T20:32:27.8044644Z D=5120, 2025-05-07T20:32:27.8044831Z scale_ub=None, 2025-05-07T20:32:27.8045050Z contiguous=True, 2025-05-07T20:32:27.8045273Z compiled=True, 2025-05-07T20:32:27.8045466Z ) 2025-05-07T20:32:28.3409708Z self = 2025-05-07T20:32:28.3410254Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:28.3410523Z 2025-05-07T20:32:28.3410615Z @given( 2025-05-07T20:32:28.3410847Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.3411171Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.3411485Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.3411819Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.3412157Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.3412444Z ) 2025-05-07T20:32:28.3412985Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.3413428Z def test_silu_mul_quant( 2025-05-07T20:32:28.3413676Z self, 2025-05-07T20:32:28.3413873Z T: int, 2025-05-07T20:32:28.3414070Z D: int, 2025-05-07T20:32:28.3414293Z scale_ub: Optional[float], 2025-05-07T20:32:28.3414563Z contiguous: bool, 2025-05-07T20:32:28.3414800Z compiled: bool, 2025-05-07T20:32:28.3415036Z ) -> None: 2025-05-07T20:32:28.3415264Z torch.manual_seed(2025) 2025-05-07T20:32:28.3415504Z 2025-05-07T20:32:28.3415788Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.3416136Z 2025-05-07T20:32:28.3416333Z x_sign = torch.sign(x) 2025-05-07T20:32:28.3416625Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:28.3416936Z x = x_sign * x_clamp 2025-05-07T20:32:28.3417175Z x0 = x[:, :D] 2025-05-07T20:32:28.3417401Z x1 = x[:, D:] 2025-05-07T20:32:28.3417615Z 2025-05-07T20:32:28.3417797Z if contiguous: 2025-05-07T20:32:28.3418037Z x0 = x0.contiguous() 2025-05-07T20:32:28.3418301Z x1 = x1.contiguous() 2025-05-07T20:32:28.3418541Z 2025-05-07T20:32:28.3418762Z if scale_ub is not None: 2025-05-07T20:32:28.3419060Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:28.3419418Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:28.3419725Z ) 2025-05-07T20:32:28.3420004Z else: 2025-05-07T20:32:28.3420223Z scale_ub_tensor = None 2025-05-07T20:32:28.3420471Z 2025-05-07T20:32:28.3420699Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:28.3421009Z op = silu_mul_quant 2025-05-07T20:32:28.3421257Z if compiled: 2025-05-07T20:32:28.3421510Z op = torch.compile(op) 2025-05-07T20:32:28.3421806Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.3422152Z 2025-05-07T20:32:28.3422353Z y_fp8, y_scale = fn() 2025-05-07T20:32:28.3422640Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:28.3422921Z 2025-05-07T20:32:28.3423160Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:28.3423493Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:28.3423778Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:28.3424089Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:28.3424448Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:28.3424844Z 2025-05-07T20:32:28.3425048Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:28.3425307Z 2025-05-07T20:32:28.3425410Z moe/activation_test.py:126: 2025-05-07T20:32:28.3425707Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.3426033Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:28.3426366Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:28.3427163Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:28.3427906Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:28.3428447Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:28.3429120Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:28.3429799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:28.3430512Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:28.3431264Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:28.3432048Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:28.3432781Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:28.3433416Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:28.3434014Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:28.3434535Z fn() 2025-05-07T20:32:28.3435034Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:28.3435612Z self.fn.run( 2025-05-07T20:32:28.3436081Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:28.3436611Z kernel = self.compile( 2025-05-07T20:32:28.3437150Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:28.3437804Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:28.3438199Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.3438422Z 2025-05-07T20:32:28.3438634Z self = 2025-05-07T20:32:28.3439699Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:28.3441069Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd0977cf9a0>} 2025-05-07T20:32:28.3442395Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:28.3443473Z context = 2025-05-07T20:32:28.3443756Z 2025-05-07T20:32:28.3443922Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:28.3444437Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:28.3444901Z module_map=module_map) 2025-05-07T20:32:28.3445271Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:28.3445667Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:28.3445934Z E ^ 2025-05-07T20:32:28.3446431Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:28.3446871Z 2025-05-07T20:32:28.3447283Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:28.3447798Z 2025-05-07T20:32:28.3447908Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.3448319Z self=, 2025-05-07T20:32:28.3448727Z T=128, 2025-05-07T20:32:28.3448907Z D=5120, 2025-05-07T20:32:28.3449099Z scale_ub=None, 2025-05-07T20:32:28.3449313Z contiguous=True, 2025-05-07T20:32:28.3449532Z compiled=True, 2025-05-07T20:32:28.3449738Z ) 2025-05-07T20:32:29.2469452Z self = 2025-05-07T20:32:29.2470206Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:29.2470600Z 2025-05-07T20:32:29.2470712Z @given( 2025-05-07T20:32:29.2471046Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:29.2471402Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:29.2471715Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:29.2472044Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:29.2472526Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:29.2472821Z ) 2025-05-07T20:32:29.2473207Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:29.2473649Z def test_silu_mul_quant( 2025-05-07T20:32:29.2473896Z self, 2025-05-07T20:32:29.2474101Z T: int, 2025-05-07T20:32:29.2474304Z D: int, 2025-05-07T20:32:29.2474534Z scale_ub: Optional[float], 2025-05-07T20:32:29.2474815Z contiguous: bool, 2025-05-07T20:32:29.2475053Z compiled: bool, 2025-05-07T20:32:29.2475290Z ) -> None: 2025-05-07T20:32:29.2475518Z torch.manual_seed(2025) 2025-05-07T20:32:29.2475756Z 2025-05-07T20:32:29.2476044Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:29.2476397Z 2025-05-07T20:32:29.2476592Z x_sign = torch.sign(x) 2025-05-07T20:32:29.2476892Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:29.2477211Z x = x_sign * x_clamp 2025-05-07T20:32:29.2477464Z x0 = x[:, :D] 2025-05-07T20:32:29.2477679Z x1 = x[:, D:] 2025-05-07T20:32:29.2477887Z 2025-05-07T20:32:29.2478077Z if contiguous: 2025-05-07T20:32:29.2478306Z x0 = x0.contiguous() 2025-05-07T20:32:29.2478565Z x1 = x1.contiguous() 2025-05-07T20:32:29.2478805Z 2025-05-07T20:32:29.2478991Z if scale_ub is not None: 2025-05-07T20:32:29.2479265Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:29.2479600Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:29.2479904Z ) 2025-05-07T20:32:29.2480101Z else: 2025-05-07T20:32:29.2480317Z scale_ub_tensor = None 2025-05-07T20:32:29.2480567Z 2025-05-07T20:32:29.2480804Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:29.2481116Z op = silu_mul_quant 2025-05-07T20:32:29.2481361Z if compiled: 2025-05-07T20:32:29.2481682Z op = torch.compile(op) 2025-05-07T20:32:29.2481980Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.2482254Z 2025-05-07T20:32:29.2482444Z y_fp8, y_scale = fn() 2025-05-07T20:32:29.2482729Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:29.2483019Z 2025-05-07T20:32:29.2483252Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:29.2483585Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:29.2483878Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:29.2484259Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:29.2484619Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:29.2484983Z 2025-05-07T20:32:29.2485179Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:29.2485381Z 2025-05-07T20:32:29.2485482Z moe/activation_test.py:126: 2025-05-07T20:32:29.2485786Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.2486123Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:29.2486444Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:29.2487231Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:29.2487995Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:29.2488540Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:29.2489222Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:29.2490189Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:29.2490930Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:29.2491754Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:29.2492666Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:29.2493548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:29.2494318Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:29.2495031Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:29.2495654Z fn() 2025-05-07T20:32:29.2496262Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:29.2496955Z self.fn.run( 2025-05-07T20:32:29.2497507Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:29.2498140Z kernel = self.compile( 2025-05-07T20:32:29.2498787Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:29.2499568Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:29.2500073Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.2500297Z 2025-05-07T20:32:29.2500510Z self = 2025-05-07T20:32:29.2501583Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:29.2503187Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd09694c700>} 2025-05-07T20:32:29.2504516Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:29.2505619Z context = 2025-05-07T20:32:29.2505910Z 2025-05-07T20:32:29.2506077Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:29.2506597Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:29.2507055Z module_map=module_map) 2025-05-07T20:32:29.2507486Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:29.2507839Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:29.2508151Z E ^ 2025-05-07T20:32:29.2508613Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:29.2509070Z 2025-05-07T20:32:29.2509484Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:29.2509990Z 2025-05-07T20:32:29.2510101Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:29.2510505Z self=, 2025-05-07T20:32:29.2510903Z T=4096, 2025-05-07T20:32:29.2511089Z D=5120, 2025-05-07T20:32:29.2511276Z scale_ub=None, 2025-05-07T20:32:29.2511488Z contiguous=True, 2025-05-07T20:32:29.2511709Z compiled=True, 2025-05-07T20:32:29.2511903Z ) 2025-05-07T20:32:29.9851149Z self = 2025-05-07T20:32:29.9851978Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:29.9852377Z 2025-05-07T20:32:29.9852502Z @given( 2025-05-07T20:32:29.9852846Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:29.9853316Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:29.9853772Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:29.9854443Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:29.9854946Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:29.9855380Z ) 2025-05-07T20:32:29.9855902Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:29.9856567Z def test_silu_mul_quant( 2025-05-07T20:32:29.9856931Z self, 2025-05-07T20:32:29.9857220Z T: int, 2025-05-07T20:32:29.9857520Z D: int, 2025-05-07T20:32:29.9857860Z scale_ub: Optional[float], 2025-05-07T20:32:29.9858267Z contiguous: bool, 2025-05-07T20:32:29.9858626Z compiled: bool, 2025-05-07T20:32:29.9858978Z ) -> None: 2025-05-07T20:32:29.9859302Z torch.manual_seed(2025) 2025-05-07T20:32:29.9859581Z 2025-05-07T20:32:29.9859983Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:29.9860334Z 2025-05-07T20:32:29.9860538Z x_sign = torch.sign(x) 2025-05-07T20:32:29.9867088Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:29.9867412Z x = x_sign * x_clamp 2025-05-07T20:32:29.9867651Z x0 = x[:, :D] 2025-05-07T20:32:29.9867866Z x1 = x[:, D:] 2025-05-07T20:32:29.9868077Z 2025-05-07T20:32:29.9868257Z if contiguous: 2025-05-07T20:32:29.9868495Z x0 = x0.contiguous() 2025-05-07T20:32:29.9868757Z x1 = x1.contiguous() 2025-05-07T20:32:29.9868994Z 2025-05-07T20:32:29.9869187Z if scale_ub is not None: 2025-05-07T20:32:29.9869495Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:29.9869858Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:29.9870166Z ) 2025-05-07T20:32:29.9870361Z else: 2025-05-07T20:32:29.9870580Z scale_ub_tensor = None 2025-05-07T20:32:29.9870828Z 2025-05-07T20:32:29.9871067Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:29.9871494Z op = silu_mul_quant 2025-05-07T20:32:29.9871748Z if compiled: 2025-05-07T20:32:29.9872009Z op = torch.compile(op) 2025-05-07T20:32:29.9872313Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.9872584Z 2025-05-07T20:32:29.9872783Z y_fp8, y_scale = fn() 2025-05-07T20:32:29.9873072Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:29.9873357Z 2025-05-07T20:32:29.9873601Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:29.9873937Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:29.9874304Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:29.9874678Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:29.9875046Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:29.9875356Z 2025-05-07T20:32:29.9875571Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:29.9875775Z 2025-05-07T20:32:29.9875884Z moe/activation_test.py:126: 2025-05-07T20:32:29.9876193Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.9876528Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:29.9876846Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:29.9877644Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:29.9878397Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:29.9878942Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:29.9879621Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:29.9880306Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:29.9881028Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:29.9881820Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:29.9882579Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:29.9883302Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:29.9883935Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:29.9884524Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:29.9885043Z fn() 2025-05-07T20:32:29.9885558Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:29.9886134Z self.fn.run( 2025-05-07T20:32:29.9886591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:29.9887122Z kernel = self.compile( 2025-05-07T20:32:29.9887662Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:29.9888308Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:29.9888702Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.9888932Z 2025-05-07T20:32:29.9889141Z self = 2025-05-07T20:32:29.9890444Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:29.9891810Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd096894280>} 2025-05-07T20:32:29.9893225Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:29.9894255Z context = 2025-05-07T20:32:29.9894539Z 2025-05-07T20:32:29.9894711Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:29.9895233Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:29.9895760Z module_map=module_map) 2025-05-07T20:32:29.9896189Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:29.9896550Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:29.9896813Z E ^ 2025-05-07T20:32:29.9897275Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:29.9897722Z 2025-05-07T20:32:29.9898137Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:29.9898655Z 2025-05-07T20:32:29.9898761Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:29.9899172Z self=, 2025-05-07T20:32:29.9899591Z T=16384, 2025-05-07T20:32:29.9899883Z D=5120, 2025-05-07T20:32:29.9900084Z scale_ub=None, 2025-05-07T20:32:29.9900294Z contiguous=True, 2025-05-07T20:32:29.9900523Z compiled=True, 2025-05-07T20:32:29.9900727Z ) 2025-05-07T20:32:30.0284479Z W0507 20:32:30.026000 87987 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] torch._dynamo hit config.recompile_limit (8) 2025-05-07T20:32:30.0286461Z W0507 20:32:30.026000 87987 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55) 2025-05-07T20:32:30.0288439Z W0507 20:32:30.026000 87987 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] last reason: 0/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240 2025-05-07T20:32:30.0289672Z W0507 20:32:30.026000 87987 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To log all recompilation reasons, use TORCH_LOGS="recompiles". 2025-05-07T20:32:30.0290914Z W0507 20:32:30.026000 87987 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html. 2025-05-07T20:32:30.1315452Z self = 2025-05-07T20:32:30.1315999Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:30.1316276Z 2025-05-07T20:32:30.1316366Z @given( 2025-05-07T20:32:30.1316608Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:30.1316936Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:30.1317256Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:30.1317602Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:30.1317930Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:30.1318221Z ) 2025-05-07T20:32:30.1318582Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:30.1319023Z def test_silu_mul_quant( 2025-05-07T20:32:30.1319276Z self, 2025-05-07T20:32:30.1319479Z T: int, 2025-05-07T20:32:30.1319683Z D: int, 2025-05-07T20:32:30.1319916Z scale_ub: Optional[float], 2025-05-07T20:32:30.1320206Z contiguous: bool, 2025-05-07T20:32:30.1320447Z compiled: bool, 2025-05-07T20:32:30.1320686Z ) -> None: 2025-05-07T20:32:30.1320911Z torch.manual_seed(2025) 2025-05-07T20:32:30.1321155Z 2025-05-07T20:32:30.1321437Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:30.1321889Z 2025-05-07T20:32:30.1322082Z x_sign = torch.sign(x) 2025-05-07T20:32:30.1322388Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:30.1322707Z x = x_sign * x_clamp 2025-05-07T20:32:30.1322955Z x0 = x[:, :D] 2025-05-07T20:32:30.1323175Z x1 = x[:, D:] 2025-05-07T20:32:30.1323393Z 2025-05-07T20:32:30.1323586Z if contiguous: 2025-05-07T20:32:30.1323820Z x0 = x0.contiguous() 2025-05-07T20:32:30.1324085Z x1 = x1.contiguous() 2025-05-07T20:32:30.1324398Z 2025-05-07T20:32:30.1324599Z if scale_ub is not None: 2025-05-07T20:32:30.1324932Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:30.1325279Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:30.1325587Z ) 2025-05-07T20:32:30.1325790Z else: 2025-05-07T20:32:30.1326011Z scale_ub_tensor = None 2025-05-07T20:32:30.1326263Z 2025-05-07T20:32:30.1326507Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:30.1326826Z op = silu_mul_quant 2025-05-07T20:32:30.1327079Z if compiled: 2025-05-07T20:32:30.1327335Z op = torch.compile(op) 2025-05-07T20:32:30.1327634Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:30.1327907Z 2025-05-07T20:32:30.1328101Z y_fp8, y_scale = fn() 2025-05-07T20:32:30.1328391Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:30.1328676Z 2025-05-07T20:32:30.1328918Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:30.1329253Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:30.1329550Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:30.1329864Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:30.1330226Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:30.1330534Z 2025-05-07T20:32:30.1330819Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:30.1331027Z 2025-05-07T20:32:30.1331130Z moe/activation_test.py:126: 2025-05-07T20:32:30.1331432Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:30.1331768Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:30.1332096Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:30.1332888Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:30.1333643Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:30.1334197Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:30.1334883Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:30.1335576Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:30.1336304Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:30.1337047Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:30.1337797Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:30.1338519Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:30.1339161Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:30.1339864Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:30.1340391Z fn() 2025-05-07T20:32:30.1340901Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:30.1341528Z self.fn.run( 2025-05-07T20:32:30.1341998Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:30.1342532Z kernel = self.compile( 2025-05-07T20:32:30.1343075Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:30.1343722Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:30.1344127Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:30.1344396Z 2025-05-07T20:32:30.1344611Z self = 2025-05-07T20:32:30.1345760Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:30.1347146Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd096894a60>} 2025-05-07T20:32:30.1348478Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:30.1349497Z context = 2025-05-07T20:32:30.1349782Z 2025-05-07T20:32:30.1349957Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:30.1350474Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:30.1350953Z module_map=module_map) 2025-05-07T20:32:30.1351324Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:30.1351685Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:30.1351947Z E ^ 2025-05-07T20:32:30.1352455Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:30.1352900Z 2025-05-07T20:32:30.1353323Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:30.1353830Z 2025-05-07T20:32:30.1353945Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:30.1354347Z self=, 2025-05-07T20:32:30.1354752Z T=1, 2025-05-07T20:32:30.1354940Z D=5120, 2025-05-07T20:32:30.1355136Z scale_ub=1200.0, 2025-05-07T20:32:30.1355361Z contiguous=True, 2025-05-07T20:32:30.1355590Z compiled=True, 2025-05-07T20:32:30.1355795Z ) 2025-05-07T20:32:30.2797689Z self = 2025-05-07T20:32:30.2798231Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:30.2798493Z 2025-05-07T20:32:30.2798594Z @given( 2025-05-07T20:32:30.2798830Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:30.2799159Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:30.2799475Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:30.2799813Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:30.2800138Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:30.2800430Z ) 2025-05-07T20:32:30.2800786Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:30.2801225Z def test_silu_mul_quant( 2025-05-07T20:32:30.2801465Z self, 2025-05-07T20:32:30.2801666Z T: int, 2025-05-07T20:32:30.2801867Z D: int, 2025-05-07T20:32:30.2802094Z scale_ub: Optional[float], 2025-05-07T20:32:30.2802379Z contiguous: bool, 2025-05-07T20:32:30.2802624Z compiled: bool, 2025-05-07T20:32:30.2802856Z ) -> None: 2025-05-07T20:32:30.2803081Z torch.manual_seed(2025) 2025-05-07T20:32:30.2803430Z 2025-05-07T20:32:30.2803712Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:30.2804062Z 2025-05-07T20:32:30.2804258Z x_sign = torch.sign(x) 2025-05-07T20:32:30.2804558Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:30.2804879Z x = x_sign * x_clamp 2025-05-07T20:32:30.2805145Z x0 = x[:, :D] 2025-05-07T20:32:30.2805361Z x1 = x[:, D:] 2025-05-07T20:32:30.2805571Z 2025-05-07T20:32:30.2805757Z if contiguous: 2025-05-07T20:32:30.2806056Z x0 = x0.contiguous() 2025-05-07T20:32:30.2806316Z x1 = x1.contiguous() 2025-05-07T20:32:30.2806554Z 2025-05-07T20:32:30.2806797Z if scale_ub is not None: 2025-05-07T20:32:30.2807076Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:30.2807414Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:30.2807715Z ) 2025-05-07T20:32:30.2807916Z else: 2025-05-07T20:32:30.2808138Z scale_ub_tensor = None 2025-05-07T20:32:30.2808383Z 2025-05-07T20:32:30.2808617Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:30.2808930Z op = silu_mul_quant 2025-05-07T20:32:30.2809179Z if compiled: 2025-05-07T20:32:30.2809431Z op = torch.compile(op) 2025-05-07T20:32:30.2809733Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:30.2810008Z 2025-05-07T20:32:30.2810198Z > y_fp8, y_scale = fn() 2025-05-07T20:32:30.2810374Z 2025-05-07T20:32:30.2810476Z moe/activation_test.py:117: 2025-05-07T20:32:30.2810782Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:30.2811112Z moe/activation_test.py:115: in fn 2025-05-07T20:32:30.2811395Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:30.2811959Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:30.2812576Z return fn(*args, **kwargs) 2025-05-07T20:32:30.2813242Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:30.2813934Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:30.2814474Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:30.2815146Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:30.2815808Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:30.2816347Z kernel = self.compile( 2025-05-07T20:32:30.2816885Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:30.2817528Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:30.2817922Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:30.2818148Z 2025-05-07T20:32:30.2818369Z self = 2025-05-07T20:32:30.2819442Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:30.2820858Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd09688f1c0>} 2025-05-07T20:32:30.2822189Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:30.2823223Z context = 2025-05-07T20:32:30.2823557Z 2025-05-07T20:32:30.2823731Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:30.2824236Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:30.2824692Z module_map=module_map) 2025-05-07T20:32:30.2825057Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:30.2825403Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:30.2825654Z E ^ 2025-05-07T20:32:30.2826118Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:30.2826614Z 2025-05-07T20:32:30.2827070Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:30.2827584Z 2025-05-07T20:32:30.2827690Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:30.2828100Z self=, 2025-05-07T20:32:30.2828504Z T=1, 2025-05-07T20:32:30.2828688Z D=5120, 2025-05-07T20:32:30.2828876Z scale_ub=None, 2025-05-07T20:32:30.2829089Z contiguous=False, 2025-05-07T20:32:30.2829313Z compiled=True, 2025-05-07T20:32:30.2829512Z ) 2025-05-07T20:32:30.3502311Z self = 2025-05-07T20:32:30.3502986Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:30.3503249Z 2025-05-07T20:32:30.3503335Z @given( 2025-05-07T20:32:30.3503576Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:30.3503899Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:30.3504215Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:30.3504539Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:30.3504872Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:30.3505161Z ) 2025-05-07T20:32:30.3505659Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:30.3506109Z def test_silu_mul_quant( 2025-05-07T20:32:30.3506354Z self, 2025-05-07T20:32:30.3506552Z T: int, 2025-05-07T20:32:30.3506751Z D: int, 2025-05-07T20:32:30.3506975Z scale_ub: Optional[float], 2025-05-07T20:32:30.3507250Z contiguous: bool, 2025-05-07T20:32:30.3507497Z compiled: bool, 2025-05-07T20:32:30.3507731Z ) -> None: 2025-05-07T20:32:30.3507952Z torch.manual_seed(2025) 2025-05-07T20:32:30.3508197Z 2025-05-07T20:32:30.3508472Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:30.3508803Z 2025-05-07T20:32:30.3508992Z x_sign = torch.sign(x) 2025-05-07T20:32:30.3509284Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:30.3509595Z x = x_sign * x_clamp 2025-05-07T20:32:30.3509840Z x0 = x[:, :D] 2025-05-07T20:32:30.3510060Z x1 = x[:, D:] 2025-05-07T20:32:30.3510282Z 2025-05-07T20:32:30.3510471Z if contiguous: 2025-05-07T20:32:30.3510710Z x0 = x0.contiguous() 2025-05-07T20:32:30.3510969Z x1 = x1.contiguous() 2025-05-07T20:32:30.3511203Z 2025-05-07T20:32:30.3511401Z if scale_ub is not None: 2025-05-07T20:32:30.3511681Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:30.3512016Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:30.3512322Z ) 2025-05-07T20:32:30.3512521Z else: 2025-05-07T20:32:30.3512739Z scale_ub_tensor = None 2025-05-07T20:32:30.3512986Z 2025-05-07T20:32:30.3513229Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:30.3513544Z op = silu_mul_quant 2025-05-07T20:32:30.3513798Z if compiled: 2025-05-07T20:32:30.3514050Z op = torch.compile(op) 2025-05-07T20:32:30.3514353Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:30.3514696Z 2025-05-07T20:32:30.3514900Z y_fp8, y_scale = fn() 2025-05-07T20:32:30.3515194Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:30.3515481Z 2025-05-07T20:32:30.3515720Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:30.3516053Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:30.3516343Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:30.3516650Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:30.3517009Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:30.3517381Z 2025-05-07T20:32:30.3517581Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:30.3517776Z 2025-05-07T20:32:30.3517936Z moe/activation_test.py:126: 2025-05-07T20:32:30.3518237Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:30.3518562Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:30.3518887Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:30.3519678Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:30.3520420Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:30.3520956Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:30.3521632Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:30.3522316Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:30.3523037Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:30.3523775Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:30.3524555Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:30.3525283Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:30.3525914Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:30.3526507Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:30.3527018Z fn() 2025-05-07T20:32:30.3527517Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:30.3528085Z self.fn.run( 2025-05-07T20:32:30.3528553Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:30.3529076Z kernel = self.compile( 2025-05-07T20:32:30.3529629Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:30.3530304Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:30.3530694Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:30.3530916Z 2025-05-07T20:32:30.3531130Z self = 2025-05-07T20:32:30.3532188Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:30.3533550Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd09688e680>} 2025-05-07T20:32:30.3534872Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:30.3535976Z context = 2025-05-07T20:32:30.3536257Z 2025-05-07T20:32:30.3536429Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:30.3536939Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:30.3537401Z module_map=module_map) 2025-05-07T20:32:30.3537761Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:30.3538110Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:30.3538418Z E ^ 2025-05-07T20:32:30.3538915Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:30.3539355Z 2025-05-07T20:32:30.3539894Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:30.3540399Z 2025-05-07T20:32:30.3540508Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:30.3540918Z self=, 2025-05-07T20:32:30.3541495Z T=1, 2025-05-07T20:32:30.3541674Z D=5120, 2025-05-07T20:32:30.3541870Z scale_ub=None, 2025-05-07T20:32:30.3542086Z contiguous=True, 2025-05-07T20:32:30.3542306Z compiled=False, 2025-05-07T20:32:30.3542510Z ) 2025-05-07T20:32:30.6804580Z self = 2025-05-07T20:32:30.6812091Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:30.6812502Z 2025-05-07T20:32:30.6812618Z @given( 2025-05-07T20:32:30.6812963Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:30.6813416Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:30.6813780Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:30.6814119Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:30.6814578Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:30.6814870Z ) 2025-05-07T20:32:30.6815226Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:30.6815669Z def test_silu_mul_quant( 2025-05-07T20:32:30.6815911Z self, 2025-05-07T20:32:30.6816105Z T: int, 2025-05-07T20:32:30.6816308Z D: int, 2025-05-07T20:32:30.6816531Z scale_ub: Optional[float], 2025-05-07T20:32:30.6816802Z contiguous: bool, 2025-05-07T20:32:30.6817053Z compiled: bool, 2025-05-07T20:32:30.6817283Z ) -> None: 2025-05-07T20:32:30.6817495Z torch.manual_seed(2025) 2025-05-07T20:32:30.6817742Z 2025-05-07T20:32:30.6818027Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:30.6818363Z 2025-05-07T20:32:30.6818557Z x_sign = torch.sign(x) 2025-05-07T20:32:30.6818851Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:30.6819152Z x = x_sign * x_clamp 2025-05-07T20:32:30.6819398Z x0 = x[:, :D] 2025-05-07T20:32:30.6819638Z x1 = x[:, D:] 2025-05-07T20:32:30.6819955Z 2025-05-07T20:32:30.6820146Z if contiguous: 2025-05-07T20:32:30.6820377Z x0 = x0.contiguous() 2025-05-07T20:32:30.6820628Z x1 = x1.contiguous() 2025-05-07T20:32:30.6820868Z 2025-05-07T20:32:30.6821067Z if scale_ub is not None: 2025-05-07T20:32:30.6821339Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:30.6821665Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:30.6821973Z ) 2025-05-07T20:32:30.6822167Z else: 2025-05-07T20:32:30.6822379Z scale_ub_tensor = None 2025-05-07T20:32:30.6822634Z 2025-05-07T20:32:30.6822866Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:30.6823169Z op = silu_mul_quant 2025-05-07T20:32:30.6823419Z if compiled: 2025-05-07T20:32:30.6823671Z op = torch.compile(op) 2025-05-07T20:32:30.6824034Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:30.6824311Z 2025-05-07T20:32:30.6824504Z > y_fp8, y_scale = fn() 2025-05-07T20:32:30.6824671Z 2025-05-07T20:32:30.6824775Z moe/activation_test.py:117: 2025-05-07T20:32:30.6825069Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:30.6825405Z moe/activation_test.py:115: in fn 2025-05-07T20:32:30.6825686Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:30.6826368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:30.6827186Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:30.6827760Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:30.6828441Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:30.6829099Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:30.6829632Z kernel = self.compile( 2025-05-07T20:32:30.6830172Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:30.6830817Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:30.6831206Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:30.6831440Z 2025-05-07T20:32:30.6831652Z self = 2025-05-07T20:32:30.6832725Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:30.6834129Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd096dbd900>} 2025-05-07T20:32:30.6835482Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:30.6836489Z context = 2025-05-07T20:32:30.6836776Z 2025-05-07T20:32:30.6836940Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:30.6837463Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:30.6837922Z module_map=module_map) 2025-05-07T20:32:30.6838290Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:30.6838645Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:30.6838898Z E ^ 2025-05-07T20:32:30.6839357Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:30.6839799Z 2025-05-07T20:32:30.6840215Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:30.6840723Z 2025-05-07T20:32:30.6840829Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:30.6841238Z self=, 2025-05-07T20:32:30.6841624Z T=128, 2025-05-07T20:32:30.6841808Z D=5120, 2025-05-07T20:32:30.6842003Z scale_ub=None, 2025-05-07T20:32:30.6842212Z contiguous=False, 2025-05-07T20:32:30.6842434Z compiled=True, 2025-05-07T20:32:30.6842638Z ) 2025-05-07T20:32:30.6842953Z self = 2025-05-07T20:32:30.6843440Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:30.6843714Z 2025-05-07T20:32:30.6843787Z @given( 2025-05-07T20:32:30.6844066Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:30.6844368Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:30.6844669Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:30.6844995Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:30.6845312Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:30.6845592Z ) 2025-05-07T20:32:30.6845933Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:30.6846367Z def test_silu_mul_quant( 2025-05-07T20:32:30.6846647Z self, 2025-05-07T20:32:30.6846839Z T: int, 2025-05-07T20:32:30.6847035Z D: int, 2025-05-07T20:32:30.6847286Z scale_ub: Optional[float], 2025-05-07T20:32:30.6847560Z contiguous: bool, 2025-05-07T20:32:30.6847794Z compiled: bool, 2025-05-07T20:32:30.6848014Z ) -> None: 2025-05-07T20:32:30.6848229Z torch.manual_seed(2025) 2025-05-07T20:32:30.6848470Z 2025-05-07T20:32:30.6848736Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:30.6849069Z 2025-05-07T20:32:30.6849260Z x_sign = torch.sign(x) 2025-05-07T20:32:30.6849548Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:30.6849899Z x = x_sign * x_clamp 2025-05-07T20:32:30.6850137Z x0 = x[:, :D] 2025-05-07T20:32:30.6850347Z x1 = x[:, D:] 2025-05-07T20:32:30.6850552Z 2025-05-07T20:32:30.6850740Z if contiguous: 2025-05-07T20:32:30.6850965Z x0 = x0.contiguous() 2025-05-07T20:32:30.6851225Z x1 = x1.contiguous() 2025-05-07T20:32:30.6851458Z 2025-05-07T20:32:30.6851652Z if scale_ub is not None: 2025-05-07T20:32:30.6851925Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:30.6852260Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:30.6852558Z ) 2025-05-07T20:32:30.6852741Z else: 2025-05-07T20:32:30.6852999Z scale_ub_tensor = None 2025-05-07T20:32:30.6853245Z 2025-05-07T20:32:30.6853467Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:30.6853778Z op = silu_mul_quant 2025-05-07T20:32:30.6854030Z if compiled: 2025-05-07T20:32:30.6854274Z op = torch.compile(op) 2025-05-07T20:32:30.6854569Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:30.6854839Z 2025-05-07T20:32:30.6855025Z > y_fp8, y_scale = fn() 2025-05-07T20:32:30.6855193Z 2025-05-07T20:32:30.6855295Z moe/activation_test.py:117: 2025-05-07T20:32:30.6855586Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:30.6855920Z moe/activation_test.py:115: in fn 2025-05-07T20:32:30.6856191Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:30.6856741Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:30.6857292Z return fn(*args, **kwargs) 2025-05-07T20:32:30.6857944Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:30.6858623Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:30.6859153Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:30.6859924Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:30.6860579Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:30.6861109Z kernel = self.compile( 2025-05-07T20:32:30.6861650Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:30.6862288Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:30.6862678Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:30.6862965Z 2025-05-07T20:32:30.6863173Z self = 2025-05-07T20:32:30.6864236Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:30.6865602Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd096dbfeb0>} 2025-05-07T20:32:30.6867014Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:30.6868026Z context = 2025-05-07T20:32:30.6868320Z 2025-05-07T20:32:30.6868487Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:30.6869004Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:30.6869457Z module_map=module_map) 2025-05-07T20:32:30.6869872Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:30.6870218Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:30.6870466Z E ^ 2025-05-07T20:32:30.6870924Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:30.6871375Z 2025-05-07T20:32:30.6871788Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:30.6872305Z 2025-05-07T20:32:30.6872414Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:30.6872815Z self=, 2025-05-07T20:32:30.6873212Z T=128, 2025-05-07T20:32:30.6873438Z D=7168, 2025-05-07T20:32:30.6873627Z scale_ub=1200.0, 2025-05-07T20:32:30.6873847Z contiguous=False, 2025-05-07T20:32:30.6874068Z compiled=False, 2025-05-07T20:32:30.6874262Z ) 2025-05-07T20:32:30.8126182Z self = 2025-05-07T20:32:30.8127310Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:30.8127885Z 2025-05-07T20:32:30.8128047Z @given( 2025-05-07T20:32:30.8128522Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:30.8129032Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:30.8129488Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:30.8129816Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:30.8130149Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:30.8130431Z ) 2025-05-07T20:32:30.8130785Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:30.8131228Z def test_silu_mul_quant( 2025-05-07T20:32:30.8131469Z self, 2025-05-07T20:32:30.8131669Z T: int, 2025-05-07T20:32:30.8131875Z D: int, 2025-05-07T20:32:30.8132092Z scale_ub: Optional[float], 2025-05-07T20:32:30.8132365Z contiguous: bool, 2025-05-07T20:32:30.8132610Z compiled: bool, 2025-05-07T20:32:30.8132836Z ) -> None: 2025-05-07T20:32:30.8133059Z torch.manual_seed(2025) 2025-05-07T20:32:30.8133294Z 2025-05-07T20:32:30.8133572Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:30.8133903Z 2025-05-07T20:32:30.8134106Z x_sign = torch.sign(x) 2025-05-07T20:32:30.8134403Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:30.8134716Z x = x_sign * x_clamp 2025-05-07T20:32:30.8134962Z x0 = x[:, :D] 2025-05-07T20:32:30.8135181Z x1 = x[:, D:] 2025-05-07T20:32:30.8135510Z 2025-05-07T20:32:30.8135696Z if contiguous: 2025-05-07T20:32:30.8135931Z x0 = x0.contiguous() 2025-05-07T20:32:30.8136194Z x1 = x1.contiguous() 2025-05-07T20:32:30.8136427Z 2025-05-07T20:32:30.8136620Z if scale_ub is not None: 2025-05-07T20:32:30.8136891Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:30.8137222Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:30.8137532Z ) 2025-05-07T20:32:30.8137728Z else: 2025-05-07T20:32:30.8137934Z scale_ub_tensor = None 2025-05-07T20:32:30.8138288Z 2025-05-07T20:32:30.8138524Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:30.8138893Z op = silu_mul_quant 2025-05-07T20:32:30.8139147Z if compiled: 2025-05-07T20:32:30.8139400Z op = torch.compile(op) 2025-05-07T20:32:30.8139694Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:30.8140037Z 2025-05-07T20:32:30.8140244Z > y_fp8, y_scale = fn() 2025-05-07T20:32:30.8140413Z 2025-05-07T20:32:30.8140532Z moe/activation_test.py:117: 2025-05-07T20:32:30.8140828Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:30.8141167Z moe/activation_test.py:115: in fn 2025-05-07T20:32:30.8141455Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:30.8142143Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:30.8142839Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:30.8143374Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:30.8144056Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:30.8144712Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:30.8145239Z kernel = self.compile( 2025-05-07T20:32:30.8145844Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:30.8146502Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:30.8146899Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:30.8147130Z 2025-05-07T20:32:30.8147346Z self = 2025-05-07T20:32:30.8148422Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:30.8149793Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd096dbd7e0>} 2025-05-07T20:32:30.8151140Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:30.8152156Z context = 2025-05-07T20:32:30.8152448Z 2025-05-07T20:32:30.8152616Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:30.8153144Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:30.8153612Z module_map=module_map) 2025-05-07T20:32:30.8153975Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:30.8154331Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:30.8154582Z E ^ 2025-05-07T20:32:30.8155046Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:30.8155539Z 2025-05-07T20:32:30.8155954Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:30.8156460Z 2025-05-07T20:32:30.8156571Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:30.8156978Z self=, 2025-05-07T20:32:30.8157376Z T=128, 2025-05-07T20:32:30.8157563Z D=5120, 2025-05-07T20:32:30.8157752Z scale_ub=None, 2025-05-07T20:32:30.8157975Z contiguous=False, 2025-05-07T20:32:30.8158204Z compiled=False, 2025-05-07T20:32:30.8158451Z ) 2025-05-07T20:32:30.8158773Z self = 2025-05-07T20:32:30.8159325Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:30.8159596Z 2025-05-07T20:32:30.8159681Z @given( 2025-05-07T20:32:30.8159912Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:30.8160230Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:30.8160543Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:30.8160868Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:30.8161194Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:30.8161474Z ) 2025-05-07T20:32:30.8161811Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:30.8162245Z def test_silu_mul_quant( 2025-05-07T20:32:30.8162484Z self, 2025-05-07T20:32:30.8162672Z T: int, 2025-05-07T20:32:30.8162873Z D: int, 2025-05-07T20:32:30.8163089Z scale_ub: Optional[float], 2025-05-07T20:32:30.8163360Z contiguous: bool, 2025-05-07T20:32:30.8163606Z compiled: bool, 2025-05-07T20:32:30.8163830Z ) -> None: 2025-05-07T20:32:30.8164045Z torch.manual_seed(2025) 2025-05-07T20:32:30.8164279Z 2025-05-07T20:32:30.8164546Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:30.8164883Z 2025-05-07T20:32:30.8165122Z x_sign = torch.sign(x) 2025-05-07T20:32:30.8165412Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:30.8165721Z x = x_sign * x_clamp 2025-05-07T20:32:30.8165955Z x0 = x[:, :D] 2025-05-07T20:32:30.8166175Z x1 = x[:, D:] 2025-05-07T20:32:30.8166379Z 2025-05-07T20:32:30.8166556Z if contiguous: 2025-05-07T20:32:30.8166782Z x0 = x0.contiguous() 2025-05-07T20:32:30.8167036Z x1 = x1.contiguous() 2025-05-07T20:32:30.8167264Z 2025-05-07T20:32:30.8167457Z if scale_ub is not None: 2025-05-07T20:32:30.8167726Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:30.8168058Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:30.8168362Z ) 2025-05-07T20:32:30.8168552Z else: 2025-05-07T20:32:30.8168763Z scale_ub_tensor = None 2025-05-07T20:32:30.8169011Z 2025-05-07T20:32:30.8169239Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:30.8169559Z op = silu_mul_quant 2025-05-07T20:32:30.8169807Z if compiled: 2025-05-07T20:32:30.8170056Z op = torch.compile(op) 2025-05-07T20:32:30.8170347Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:30.8170611Z 2025-05-07T20:32:30.8170802Z > y_fp8, y_scale = fn() 2025-05-07T20:32:30.8170983Z 2025-05-07T20:32:30.8171082Z moe/activation_test.py:117: 2025-05-07T20:32:30.8171370Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:30.8171695Z moe/activation_test.py:115: in fn 2025-05-07T20:32:30.8171973Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:30.8172655Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:30.8173331Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:30.8173866Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:30.8174598Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:30.8175252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:30.8175778Z kernel = self.compile( 2025-05-07T20:32:30.8176311Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:30.8176953Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:30.8177387Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:30.8177645Z 2025-05-07T20:32:30.8177855Z self = 2025-05-07T20:32:30.8178920Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:30.8180358Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd09696beb0>} 2025-05-07T20:32:30.8181685Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:30.8182693Z context = 2025-05-07T20:32:30.8182982Z 2025-05-07T20:32:30.8183151Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:30.8183673Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:30.8184136Z module_map=module_map) 2025-05-07T20:32:30.8184543Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:30.8184899Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:30.8185159Z E ^ 2025-05-07T20:32:30.8185613Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:30.8186056Z 2025-05-07T20:32:30.8186467Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:30.8186977Z 2025-05-07T20:32:30.8187082Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:30.8187495Z self=, 2025-05-07T20:32:30.8187887Z T=128, 2025-05-07T20:32:30.8188077Z D=5120, 2025-05-07T20:32:30.8188271Z scale_ub=1200.0, 2025-05-07T20:32:30.8188491Z contiguous=True, 2025-05-07T20:32:30.8188714Z compiled=False, 2025-05-07T20:32:30.8188912Z ) 2025-05-07T20:32:31.0110889Z self = 2025-05-07T20:32:31.0111668Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:31.0112046Z 2025-05-07T20:32:31.0112167Z @given( 2025-05-07T20:32:31.0112455Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:31.0112777Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:31.0113090Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:31.0113420Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:31.0113756Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:31.0114044Z ) 2025-05-07T20:32:31.0114389Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:31.0114830Z def test_silu_mul_quant( 2025-05-07T20:32:31.0115067Z self, 2025-05-07T20:32:31.0115256Z T: int, 2025-05-07T20:32:31.0115459Z D: int, 2025-05-07T20:32:31.0115688Z scale_ub: Optional[float], 2025-05-07T20:32:31.0116070Z contiguous: bool, 2025-05-07T20:32:31.0116309Z compiled: bool, 2025-05-07T20:32:31.0116536Z ) -> None: 2025-05-07T20:32:31.0116758Z torch.manual_seed(2025) 2025-05-07T20:32:31.0116990Z 2025-05-07T20:32:31.0117271Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:31.0117611Z 2025-05-07T20:32:31.0117801Z x_sign = torch.sign(x) 2025-05-07T20:32:31.0118095Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:31.0118403Z x = x_sign * x_clamp 2025-05-07T20:32:31.0118710Z x0 = x[:, :D] 2025-05-07T20:32:31.0118931Z x1 = x[:, D:] 2025-05-07T20:32:31.0119140Z 2025-05-07T20:32:31.0119378Z if contiguous: 2025-05-07T20:32:31.0119617Z x0 = x0.contiguous() 2025-05-07T20:32:31.0119877Z x1 = x1.contiguous() 2025-05-07T20:32:31.0120109Z 2025-05-07T20:32:31.0120308Z if scale_ub is not None: 2025-05-07T20:32:31.0120587Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:31.0120934Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:31.0121242Z ) 2025-05-07T20:32:31.0121439Z else: 2025-05-07T20:32:31.0121659Z scale_ub_tensor = None 2025-05-07T20:32:31.0121903Z 2025-05-07T20:32:31.0122144Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:31.0122464Z op = silu_mul_quant 2025-05-07T20:32:31.0122716Z if compiled: 2025-05-07T20:32:31.0122971Z op = torch.compile(op) 2025-05-07T20:32:31.0123271Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:31.0123538Z 2025-05-07T20:32:31.0123741Z > y_fp8, y_scale = fn() 2025-05-07T20:32:31.0123913Z 2025-05-07T20:32:31.0124027Z moe/activation_test.py:117: 2025-05-07T20:32:31.0124315Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:31.0124649Z moe/activation_test.py:115: in fn 2025-05-07T20:32:31.0124944Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:31.0125698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:31.0126391Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:31.0133246Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:31.0133998Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:31.0134671Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:31.0135220Z kernel = self.compile( 2025-05-07T20:32:31.0135767Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:31.0136430Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:31.0136834Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:31.0137070Z 2025-05-07T20:32:31.0137291Z self = 2025-05-07T20:32:31.0138362Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:31.0139739Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd09696ad40>} 2025-05-07T20:32:31.0141200Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:31.0142235Z context = 2025-05-07T20:32:31.0142600Z 2025-05-07T20:32:31.0142784Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:31.0143317Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:31.0143791Z module_map=module_map) 2025-05-07T20:32:31.0144165Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:31.0144521Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:31.0144784Z E ^ 2025-05-07T20:32:31.0145251Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:31.0145751Z 2025-05-07T20:32:31.0146214Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:31.0146735Z 2025-05-07T20:32:31.0146844Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:31.0147264Z self=, 2025-05-07T20:32:31.0147675Z T=1, 2025-05-07T20:32:31.0147862Z D=7168, 2025-05-07T20:32:31.0148066Z scale_ub=1200.0, 2025-05-07T20:32:31.0148299Z contiguous=True, 2025-05-07T20:32:31.0148518Z compiled=True, 2025-05-07T20:32:31.0148728Z ) 2025-05-07T20:32:31.0149048Z self = 2025-05-07T20:32:31.0149536Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:31.0149796Z 2025-05-07T20:32:31.0149875Z @given( 2025-05-07T20:32:31.0150112Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:31.0150420Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:31.0150717Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:31.0151042Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:31.0151368Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:31.0151637Z ) 2025-05-07T20:32:31.0152024Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:31.0152466Z def test_silu_mul_quant( 2025-05-07T20:32:31.0152699Z self, 2025-05-07T20:32:31.0152891Z T: int, 2025-05-07T20:32:31.0153087Z D: int, 2025-05-07T20:32:31.0153301Z scale_ub: Optional[float], 2025-05-07T20:32:31.0153570Z contiguous: bool, 2025-05-07T20:32:31.0153809Z compiled: bool, 2025-05-07T20:32:31.0154034Z ) -> None: 2025-05-07T20:32:31.0154243Z torch.manual_seed(2025) 2025-05-07T20:32:31.0154485Z 2025-05-07T20:32:31.0154756Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:31.0155088Z 2025-05-07T20:32:31.0155284Z x_sign = torch.sign(x) 2025-05-07T20:32:31.0155570Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:31.0155876Z x = x_sign * x_clamp 2025-05-07T20:32:31.0156111Z x0 = x[:, :D] 2025-05-07T20:32:31.0156327Z x1 = x[:, D:] 2025-05-07T20:32:31.0156527Z 2025-05-07T20:32:31.0156718Z if contiguous: 2025-05-07T20:32:31.0156941Z x0 = x0.contiguous() 2025-05-07T20:32:31.0157186Z x1 = x1.contiguous() 2025-05-07T20:32:31.0157418Z 2025-05-07T20:32:31.0157614Z if scale_ub is not None: 2025-05-07T20:32:31.0157879Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:31.0158208Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:31.0158506Z ) 2025-05-07T20:32:31.0158687Z else: 2025-05-07T20:32:31.0158896Z scale_ub_tensor = None 2025-05-07T20:32:31.0159144Z 2025-05-07T20:32:31.0159367Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:31.0159702Z op = silu_mul_quant 2025-05-07T20:32:31.0159973Z if compiled: 2025-05-07T20:32:31.0160213Z op = torch.compile(op) 2025-05-07T20:32:31.0160499Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:31.0160816Z 2025-05-07T20:32:31.0161005Z > y_fp8, y_scale = fn() 2025-05-07T20:32:31.0161168Z 2025-05-07T20:32:31.0161262Z moe/activation_test.py:117: 2025-05-07T20:32:31.0161550Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:31.0161875Z moe/activation_test.py:115: in fn 2025-05-07T20:32:31.0162145Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:31.0162694Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:31.0163240Z return fn(*args, **kwargs) 2025-05-07T20:32:31.0163965Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:31.0164652Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:31.0165178Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:31.0165845Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:31.0166497Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:31.0167013Z kernel = self.compile( 2025-05-07T20:32:31.0167541Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:31.0168187Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:31.0168568Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:31.0168795Z 2025-05-07T20:32:31.0168995Z self = 2025-05-07T20:32:31.0170108Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:31.0171495Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd096969480>} 2025-05-07T20:32:31.0172819Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:31.0173834Z context = 2025-05-07T20:32:31.0174121Z 2025-05-07T20:32:31.0174289Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:31.0174799Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:31.0175259Z module_map=module_map) 2025-05-07T20:32:31.0175611Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:31.0175952Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:31.0176202Z E ^ 2025-05-07T20:32:31.0176654Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:31.0177104Z 2025-05-07T20:32:31.0177512Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:31.0178017Z 2025-05-07T20:32:31.0178114Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:31.0178520Z self=, 2025-05-07T20:32:31.0178907Z T=1, 2025-05-07T20:32:31.0179084Z D=7168, 2025-05-07T20:32:31.0179271Z scale_ub=1200.0, 2025-05-07T20:32:31.0179488Z contiguous=False, 2025-05-07T20:32:31.0179723Z compiled=True, 2025-05-07T20:32:31.0180014Z ) 2025-05-07T20:32:31.1552061Z self = 2025-05-07T20:32:31.1552835Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:31.1553347Z 2025-05-07T20:32:31.1553437Z @given( 2025-05-07T20:32:31.1553667Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:31.1553980Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:31.1554286Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:31.1554613Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:31.1554942Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:31.1555233Z ) 2025-05-07T20:32:31.1555581Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:31.1556092Z def test_silu_mul_quant( 2025-05-07T20:32:31.1556342Z self, 2025-05-07T20:32:31.1556600Z T: int, 2025-05-07T20:32:31.1556798Z D: int, 2025-05-07T20:32:31.1557017Z scale_ub: Optional[float], 2025-05-07T20:32:31.1557299Z contiguous: bool, 2025-05-07T20:32:31.1557534Z compiled: bool, 2025-05-07T20:32:31.1557763Z ) -> None: 2025-05-07T20:32:31.1557989Z torch.manual_seed(2025) 2025-05-07T20:32:31.1558224Z 2025-05-07T20:32:31.1558498Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:31.1558836Z 2025-05-07T20:32:31.1559035Z x_sign = torch.sign(x) 2025-05-07T20:32:31.1559323Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:31.1559641Z x = x_sign * x_clamp 2025-05-07T20:32:31.1559882Z x0 = x[:, :D] 2025-05-07T20:32:31.1560102Z x1 = x[:, D:] 2025-05-07T20:32:31.1560308Z 2025-05-07T20:32:31.1560497Z if contiguous: 2025-05-07T20:32:31.1560727Z x0 = x0.contiguous() 2025-05-07T20:32:31.1560988Z x1 = x1.contiguous() 2025-05-07T20:32:31.1561230Z 2025-05-07T20:32:31.1561419Z if scale_ub is not None: 2025-05-07T20:32:31.1561690Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:31.1562021Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:31.1562327Z ) 2025-05-07T20:32:31.1562599Z else: 2025-05-07T20:32:31.1562819Z scale_ub_tensor = None 2025-05-07T20:32:31.1563077Z 2025-05-07T20:32:31.1563312Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:31.1563630Z op = silu_mul_quant 2025-05-07T20:32:31.1563882Z if compiled: 2025-05-07T20:32:31.1564130Z op = torch.compile(op) 2025-05-07T20:32:31.1564426Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:31.1564691Z 2025-05-07T20:32:31.1564887Z > y_fp8, y_scale = fn() 2025-05-07T20:32:31.1565064Z 2025-05-07T20:32:31.1565168Z moe/activation_test.py:117: 2025-05-07T20:32:31.1565472Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:31.1565798Z moe/activation_test.py:115: in fn 2025-05-07T20:32:31.1566080Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:31.1566635Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:31.1567196Z return fn(*args, **kwargs) 2025-05-07T20:32:31.1567855Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:31.1568533Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:31.1569076Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:31.1569742Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:31.1570409Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:31.1570939Z kernel = self.compile( 2025-05-07T20:32:31.1571472Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:31.1572131Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:31.1572650Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:31.1572876Z 2025-05-07T20:32:31.1573082Z self = 2025-05-07T20:32:31.1574152Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:31.1575525Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd096968940>} 2025-05-07T20:32:31.1576959Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:31.1577993Z context = 2025-05-07T20:32:31.1578288Z 2025-05-07T20:32:31.1578456Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:31.1578975Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:31.1579444Z module_map=module_map) 2025-05-07T20:32:31.1579898Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:31.1580253Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:31.1580509Z E ^ 2025-05-07T20:32:31.1580972Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:31.1581415Z 2025-05-07T20:32:31.1581829Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:31.1582340Z 2025-05-07T20:32:31.1582445Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:31.1582943Z self=, 2025-05-07T20:32:31.1583352Z T=1, 2025-05-07T20:32:31.1583526Z D=7168, 2025-05-07T20:32:31.1583717Z scale_ub=None, 2025-05-07T20:32:31.1583930Z contiguous=False, 2025-05-07T20:32:31.1584150Z compiled=True, 2025-05-07T20:32:31.1584350Z ) 2025-05-07T20:32:31.4185281Z self = 2025-05-07T20:32:31.4186033Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:31.4186397Z 2025-05-07T20:32:31.4186517Z @given( 2025-05-07T20:32:31.4186841Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:31.4187251Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:31.4187560Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:31.4187897Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:31.4188230Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:31.4188513Z ) 2025-05-07T20:32:31.4188867Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:31.4189303Z def test_silu_mul_quant( 2025-05-07T20:32:31.4189542Z self, 2025-05-07T20:32:31.4189734Z T: int, 2025-05-07T20:32:31.4190097Z D: int, 2025-05-07T20:32:31.4190321Z scale_ub: Optional[float], 2025-05-07T20:32:31.4190594Z contiguous: bool, 2025-05-07T20:32:31.4190834Z compiled: bool, 2025-05-07T20:32:31.4191057Z ) -> None: 2025-05-07T20:32:31.4191274Z torch.manual_seed(2025) 2025-05-07T20:32:31.4191528Z 2025-05-07T20:32:31.4191809Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:31.4192154Z 2025-05-07T20:32:31.4192357Z x_sign = torch.sign(x) 2025-05-07T20:32:31.4192654Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:31.4192961Z x = x_sign * x_clamp 2025-05-07T20:32:31.4193209Z x0 = x[:, :D] 2025-05-07T20:32:31.4193544Z x1 = x[:, D:] 2025-05-07T20:32:31.4193759Z 2025-05-07T20:32:31.4193950Z if contiguous: 2025-05-07T20:32:31.4194188Z x0 = x0.contiguous() 2025-05-07T20:32:31.4194447Z x1 = x1.contiguous() 2025-05-07T20:32:31.4194691Z 2025-05-07T20:32:31.4194893Z if scale_ub is not None: 2025-05-07T20:32:31.4195170Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:31.4195498Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:31.4195801Z ) 2025-05-07T20:32:31.4196068Z else: 2025-05-07T20:32:31.4196284Z scale_ub_tensor = None 2025-05-07T20:32:31.4196534Z 2025-05-07T20:32:31.4196832Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:31.4197148Z op = silu_mul_quant 2025-05-07T20:32:31.4197404Z if compiled: 2025-05-07T20:32:31.4197650Z op = torch.compile(op) 2025-05-07T20:32:31.4197941Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:31.4198222Z 2025-05-07T20:32:31.4198409Z y_fp8, y_scale = fn() 2025-05-07T20:32:31.4198687Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:31.4198972Z 2025-05-07T20:32:31.4199229Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:31.4199566Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:31.4199853Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:31.4200170Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:31.4200539Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:31.4200840Z 2025-05-07T20:32:31.4201047Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:31.4201245Z 2025-05-07T20:32:31.4201345Z moe/activation_test.py:126: 2025-05-07T20:32:31.4201643Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:31.4201966Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:31.4202353Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:31.4203144Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:31.4203885Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:31.4204427Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:31.4205104Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:31.4205782Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:31.4206495Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:31.4207240Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:31.4207983Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:31.4208708Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:31.4209333Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:31.4209936Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:31.4210450Z fn() 2025-05-07T20:32:31.4210946Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:31.4211524Z self.fn.run( 2025-05-07T20:32:31.4211993Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:31.4212522Z kernel = self.compile( 2025-05-07T20:32:31.4213055Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:31.4213761Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:31.4214158Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:31.4214382Z 2025-05-07T20:32:31.4214590Z self = 2025-05-07T20:32:31.4215661Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:31.4217110Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd0965a2ef0>} 2025-05-07T20:32:31.4218433Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:31.4219458Z context = 2025-05-07T20:32:31.4219740Z 2025-05-07T20:32:31.4219978Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:31.4220496Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:31.4220961Z module_map=module_map) 2025-05-07T20:32:31.4221327Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:31.4221680Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:31.4221946Z E ^ 2025-05-07T20:32:31.4222413Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:31.4222856Z 2025-05-07T20:32:31.4223272Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:31.4223785Z 2025-05-07T20:32:31.4223937Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:31.4224350Z self=, 2025-05-07T20:32:31.4224737Z T=1, 2025-05-07T20:32:31.4224912Z D=5120, 2025-05-07T20:32:31.4225103Z scale_ub=1200.0, 2025-05-07T20:32:31.4225330Z contiguous=False, 2025-05-07T20:32:31.4225551Z compiled=True, 2025-05-07T20:32:31.4225755Z ) 2025-05-07T20:32:31.5900825Z self = 2025-05-07T20:32:31.5901599Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:31.5901977Z 2025-05-07T20:32:31.5902087Z @given( 2025-05-07T20:32:31.5902422Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:31.5902734Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:31.5903053Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:31.5903393Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:31.5903738Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:31.5904024Z ) 2025-05-07T20:32:31.5904383Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:31.5904833Z def test_silu_mul_quant( 2025-05-07T20:32:31.5905078Z self, 2025-05-07T20:32:31.5905287Z T: int, 2025-05-07T20:32:31.5905488Z D: int, 2025-05-07T20:32:31.5905713Z scale_ub: Optional[float], 2025-05-07T20:32:31.5905991Z contiguous: bool, 2025-05-07T20:32:31.5906237Z compiled: bool, 2025-05-07T20:32:31.5906465Z ) -> None: 2025-05-07T20:32:31.5906691Z torch.manual_seed(2025) 2025-05-07T20:32:31.5906943Z 2025-05-07T20:32:31.5907213Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:31.5907554Z 2025-05-07T20:32:31.5907753Z x_sign = torch.sign(x) 2025-05-07T20:32:31.5908036Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:31.5908467Z x = x_sign * x_clamp 2025-05-07T20:32:31.5908713Z x0 = x[:, :D] 2025-05-07T20:32:31.5908931Z x1 = x[:, D:] 2025-05-07T20:32:31.5909135Z 2025-05-07T20:32:31.5909322Z if contiguous: 2025-05-07T20:32:31.5909556Z x0 = x0.contiguous() 2025-05-07T20:32:31.5909808Z x1 = x1.contiguous() 2025-05-07T20:32:31.5910046Z 2025-05-07T20:32:31.5910242Z if scale_ub is not None: 2025-05-07T20:32:31.5910516Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:31.5910851Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:31.5911237Z ) 2025-05-07T20:32:31.5911421Z else: 2025-05-07T20:32:31.5911691Z scale_ub_tensor = None 2025-05-07T20:32:31.5911949Z 2025-05-07T20:32:31.5912179Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:31.5912492Z op = silu_mul_quant 2025-05-07T20:32:31.5912744Z if compiled: 2025-05-07T20:32:31.5912996Z op = torch.compile(op) 2025-05-07T20:32:31.5913290Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:31.5913564Z 2025-05-07T20:32:31.5913760Z > y_fp8, y_scale = fn() 2025-05-07T20:32:31.5913935Z 2025-05-07T20:32:31.5914044Z moe/activation_test.py:117: 2025-05-07T20:32:31.5914346Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:31.5914681Z moe/activation_test.py:115: in fn 2025-05-07T20:32:31.5914964Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:31.5915526Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:31.5916090Z return fn(*args, **kwargs) 2025-05-07T20:32:31.5916747Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:31.5917432Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:31.5918036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:31.5918723Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:31.5919378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:31.5919915Z kernel = self.compile( 2025-05-07T20:32:31.5920461Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:31.5921114Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:31.5921509Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:31.5921746Z 2025-05-07T20:32:31.5921955Z self = 2025-05-07T20:32:31.5923030Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:31.5924404Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd0965a3eb0>} 2025-05-07T20:32:31.5925731Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:31.5926761Z context = 2025-05-07T20:32:31.5927053Z 2025-05-07T20:32:31.5927227Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:31.5933209Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:31.5933737Z module_map=module_map) 2025-05-07T20:32:31.5934195Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:31.5934556Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:31.5934819Z E ^ 2025-05-07T20:32:31.5935298Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:31.5935755Z 2025-05-07T20:32:31.5936179Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:31.5936700Z 2025-05-07T20:32:31.5936860Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:31.5937281Z self=, 2025-05-07T20:32:31.5937727Z T=1, 2025-05-07T20:32:31.5937926Z D=5120, 2025-05-07T20:32:31.5938124Z scale_ub=1200.0, 2025-05-07T20:32:31.5938352Z contiguous=False, 2025-05-07T20:32:31.5938585Z compiled=False, 2025-05-07T20:32:31.5938789Z ) 2025-05-07T20:32:31.5939116Z self = 2025-05-07T20:32:31.5939613Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:31.5939943Z 2025-05-07T20:32:31.5940035Z @given( 2025-05-07T20:32:31.5940270Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:31.5940591Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:31.5940909Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:31.5941238Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:31.5941579Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:31.5941869Z ) 2025-05-07T20:32:31.5942225Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:31.5942671Z def test_silu_mul_quant( 2025-05-07T20:32:31.5942917Z self, 2025-05-07T20:32:31.5943119Z T: int, 2025-05-07T20:32:31.5943322Z D: int, 2025-05-07T20:32:31.5943555Z scale_ub: Optional[float], 2025-05-07T20:32:31.5943877Z contiguous: bool, 2025-05-07T20:32:31.5944128Z compiled: bool, 2025-05-07T20:32:31.5944362Z ) -> None: 2025-05-07T20:32:31.5944585Z torch.manual_seed(2025) 2025-05-07T20:32:31.5944834Z 2025-05-07T20:32:31.5945107Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:31.5945455Z 2025-05-07T20:32:31.5945650Z x_sign = torch.sign(x) 2025-05-07T20:32:31.5945946Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:31.5946257Z x = x_sign * x_clamp 2025-05-07T20:32:31.5946504Z x0 = x[:, :D] 2025-05-07T20:32:31.5946725Z x1 = x[:, D:] 2025-05-07T20:32:31.5946940Z 2025-05-07T20:32:31.5947127Z if contiguous: 2025-05-07T20:32:31.5947364Z x0 = x0.contiguous() 2025-05-07T20:32:31.5947624Z x1 = x1.contiguous() 2025-05-07T20:32:31.5947866Z 2025-05-07T20:32:31.5948064Z if scale_ub is not None: 2025-05-07T20:32:31.5948352Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:31.5948693Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:31.5949003Z ) 2025-05-07T20:32:31.5949204Z else: 2025-05-07T20:32:31.5949421Z scale_ub_tensor = None 2025-05-07T20:32:31.5949675Z 2025-05-07T20:32:31.5949910Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:31.5950226Z op = silu_mul_quant 2025-05-07T20:32:31.5950483Z if compiled: 2025-05-07T20:32:31.5950738Z op = torch.compile(op) 2025-05-07T20:32:31.5951043Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:31.5951317Z 2025-05-07T20:32:31.5951520Z > y_fp8, y_scale = fn() 2025-05-07T20:32:31.5951684Z 2025-05-07T20:32:31.5951792Z moe/activation_test.py:117: 2025-05-07T20:32:31.5952084Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:31.5952419Z moe/activation_test.py:115: in fn 2025-05-07T20:32:31.5952769Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:31.5953476Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:31.5954166Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:31.5954711Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:31.5955392Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:31.5956110Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:31.5956683Z kernel = self.compile( 2025-05-07T20:32:31.5957234Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:31.5957889Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:31.5958295Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:31.5958522Z 2025-05-07T20:32:31.5958731Z self = 2025-05-07T20:32:31.5959823Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:31.5961213Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd09696be20>} 2025-05-07T20:32:31.5962556Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:31.5963572Z context = 2025-05-07T20:32:31.5963904Z 2025-05-07T20:32:31.5964085Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:31.5964622Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:31.5965089Z module_map=module_map) 2025-05-07T20:32:31.5965459Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:31.5965814Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:31.5966072Z E ^ 2025-05-07T20:32:31.5966538Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:31.5966984Z 2025-05-07T20:32:31.5967411Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:31.5967922Z 2025-05-07T20:32:31.5968030Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:31.5968445Z self=, 2025-05-07T20:32:31.5968848Z T=16384, 2025-05-07T20:32:31.5969042Z D=5120, 2025-05-07T20:32:31.5969238Z scale_ub=1200.0, 2025-05-07T20:32:31.5969466Z contiguous=False, 2025-05-07T20:32:31.5969698Z compiled=True, 2025-05-07T20:32:31.5969926Z ) 2025-05-07T20:32:31.6964154Z self = 2025-05-07T20:32:31.6965307Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:31.6965738Z 2025-05-07T20:32:31.6965861Z @given( 2025-05-07T20:32:31.6966220Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:31.6966688Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:31.6967147Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:31.6967639Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:31.6968127Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:31.6968695Z ) 2025-05-07T20:32:31.6969215Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:31.6969862Z def test_silu_mul_quant( 2025-05-07T20:32:31.6970097Z self, 2025-05-07T20:32:31.6970301Z T: int, 2025-05-07T20:32:31.6970503Z D: int, 2025-05-07T20:32:31.6970721Z scale_ub: Optional[float], 2025-05-07T20:32:31.6971002Z contiguous: bool, 2025-05-07T20:32:31.6971245Z compiled: bool, 2025-05-07T20:32:31.6971466Z ) -> None: 2025-05-07T20:32:31.6971679Z torch.manual_seed(2025) 2025-05-07T20:32:31.6971993Z 2025-05-07T20:32:31.6972265Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:31.6972667Z 2025-05-07T20:32:31.6972864Z x_sign = torch.sign(x) 2025-05-07T20:32:31.6973150Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:31.6973471Z x = x_sign * x_clamp 2025-05-07T20:32:31.6973717Z x0 = x[:, :D] 2025-05-07T20:32:31.6973943Z x1 = x[:, D:] 2025-05-07T20:32:31.6974150Z 2025-05-07T20:32:31.6974339Z if contiguous: 2025-05-07T20:32:31.6974574Z x0 = x0.contiguous() 2025-05-07T20:32:31.6974840Z x1 = x1.contiguous() 2025-05-07T20:32:31.6975080Z 2025-05-07T20:32:31.6975273Z if scale_ub is not None: 2025-05-07T20:32:31.6975543Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:31.6975874Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:31.6976185Z ) 2025-05-07T20:32:31.6976371Z else: 2025-05-07T20:32:31.6976592Z scale_ub_tensor = None 2025-05-07T20:32:31.6976838Z 2025-05-07T20:32:31.6977071Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:31.6977386Z op = silu_mul_quant 2025-05-07T20:32:31.6977645Z if compiled: 2025-05-07T20:32:31.6977896Z op = torch.compile(op) 2025-05-07T20:32:31.6978189Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:31.6978470Z 2025-05-07T20:32:31.6978731Z > y_fp8, y_scale = fn() 2025-05-07T20:32:31.6978903Z 2025-05-07T20:32:31.6979006Z moe/activation_test.py:117: 2025-05-07T20:32:31.6979312Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:31.6979645Z moe/activation_test.py:115: in fn 2025-05-07T20:32:31.6979986Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:31.6980548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:31.6981110Z return fn(*args, **kwargs) 2025-05-07T20:32:31.6981762Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:31.6982442Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:31.6982975Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:31.6983662Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:31.6984320Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:31.6984844Z kernel = self.compile( 2025-05-07T20:32:31.6985383Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:31.6986031Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:31.6986418Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:31.6986651Z 2025-05-07T20:32:31.6986861Z self = 2025-05-07T20:32:31.6987925Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:31.6989343Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fcec1a0c8b0>} 2025-05-07T20:32:31.6990836Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:31.6991866Z context = 2025-05-07T20:32:31.6992228Z 2025-05-07T20:32:31.6992397Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:31.6992992Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:31.6993448Z module_map=module_map) 2025-05-07T20:32:31.6993812Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:31.6994166Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:31.6994421Z E ^ 2025-05-07T20:32:31.6994888Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:31.6995335Z 2025-05-07T20:32:31.6995751Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:31.6996259Z 2025-05-07T20:32:31.6996369Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:31.6996776Z self=, 2025-05-07T20:32:31.6997181Z T=2048, 2025-05-07T20:32:31.6997372Z D=7168, 2025-05-07T20:32:31.6997560Z scale_ub=1200.0, 2025-05-07T20:32:31.6997788Z contiguous=False, 2025-05-07T20:32:31.6998017Z compiled=True, 2025-05-07T20:32:31.6998216Z ) 2025-05-07T20:32:31.6998528Z self = 2025-05-07T20:32:31.6999083Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:31.6999356Z 2025-05-07T20:32:31.6999439Z @given( 2025-05-07T20:32:31.6999667Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:31.6999979Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:31.7000288Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:31.7000610Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:31.7000944Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:31.7001235Z ) 2025-05-07T20:32:31.7001590Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:31.7002025Z def test_silu_mul_quant( 2025-05-07T20:32:31.7002272Z self, 2025-05-07T20:32:31.7002466Z T: int, 2025-05-07T20:32:31.7002661Z D: int, 2025-05-07T20:32:31.7002882Z scale_ub: Optional[float], 2025-05-07T20:32:31.7003154Z contiguous: bool, 2025-05-07T20:32:31.7003386Z compiled: bool, 2025-05-07T20:32:31.7003614Z ) -> None: 2025-05-07T20:32:31.7003833Z torch.manual_seed(2025) 2025-05-07T20:32:31.7004072Z 2025-05-07T20:32:31.7004342Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:31.7004685Z 2025-05-07T20:32:31.7004878Z x_sign = torch.sign(x) 2025-05-07T20:32:31.7005165Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:31.7005474Z x = x_sign * x_clamp 2025-05-07T20:32:31.7005716Z x0 = x[:, :D] 2025-05-07T20:32:31.7005930Z x1 = x[:, D:] 2025-05-07T20:32:31.7006139Z 2025-05-07T20:32:31.7006324Z if contiguous: 2025-05-07T20:32:31.7006554Z x0 = x0.contiguous() 2025-05-07T20:32:31.7006818Z x1 = x1.contiguous() 2025-05-07T20:32:31.7007063Z 2025-05-07T20:32:31.7007253Z if scale_ub is not None: 2025-05-07T20:32:31.7007529Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:31.7007872Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:31.7008244Z ) 2025-05-07T20:32:31.7008437Z else: 2025-05-07T20:32:31.7008648Z scale_ub_tensor = None 2025-05-07T20:32:31.7008895Z 2025-05-07T20:32:31.7009125Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:31.7009441Z op = silu_mul_quant 2025-05-07T20:32:31.7009692Z if compiled: 2025-05-07T20:32:31.7009941Z op = torch.compile(op) 2025-05-07T20:32:31.7010241Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:31.7010554Z 2025-05-07T20:32:31.7010746Z > y_fp8, y_scale = fn() 2025-05-07T20:32:31.7010914Z 2025-05-07T20:32:31.7011013Z moe/activation_test.py:117: 2025-05-07T20:32:31.7011342Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:31.7011674Z moe/activation_test.py:115: in fn 2025-05-07T20:32:31.7011955Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:31.7012511Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:31.7013082Z return fn(*args, **kwargs) 2025-05-07T20:32:31.7013735Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:31.7014415Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:31.7014956Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:31.7015628Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:31.7016298Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:31.7016829Z kernel = self.compile( 2025-05-07T20:32:31.7017370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:31.7018061Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:31.7018466Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:31.7018690Z 2025-05-07T20:32:31.7018901Z self = 2025-05-07T20:32:31.7020040Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:31.7021398Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fcec1a0d090>} 2025-05-07T20:32:31.7022731Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:31.7023767Z context = 2025-05-07T20:32:31.7024054Z 2025-05-07T20:32:31.7024230Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:31.7024746Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:31.7025219Z module_map=module_map) 2025-05-07T20:32:31.7025588Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:31.7025944Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:31.7026199Z E ^ 2025-05-07T20:32:31.7026659Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:31.7027102Z 2025-05-07T20:32:31.7027519Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:31.7028023Z 2025-05-07T20:32:31.8317441Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:31.8318068Z self=, 2025-05-07T20:32:31.8318479Z T=1, 2025-05-07T20:32:31.8318666Z D=5120, 2025-05-07T20:32:31.8318859Z scale_ub=None, 2025-05-07T20:32:31.8319073Z contiguous=False, 2025-05-07T20:32:31.8319308Z compiled=False, 2025-05-07T20:32:31.8319520Z ) 2025-05-07T20:32:31.8319834Z self = 2025-05-07T20:32:31.8320321Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:31.8320661Z 2025-05-07T20:32:31.8320744Z @given( 2025-05-07T20:32:31.8321030Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:31.8321339Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:31.8321648Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:31.8321977Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:31.8322302Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:31.8322593Z ) 2025-05-07T20:32:31.8322938Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:31.8323381Z def test_silu_mul_quant( 2025-05-07T20:32:31.8323621Z self, 2025-05-07T20:32:31.8323818Z T: int, 2025-05-07T20:32:31.8324017Z D: int, 2025-05-07T20:32:31.8324241Z scale_ub: Optional[float], 2025-05-07T20:32:31.8324515Z contiguous: bool, 2025-05-07T20:32:31.8324753Z compiled: bool, 2025-05-07T20:32:31.8324971Z ) -> None: 2025-05-07T20:32:31.8325187Z torch.manual_seed(2025) 2025-05-07T20:32:31.8325422Z 2025-05-07T20:32:31.8325701Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:31.8326043Z 2025-05-07T20:32:31.8326237Z x_sign = torch.sign(x) 2025-05-07T20:32:31.8326527Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:31.8326839Z x = x_sign * x_clamp 2025-05-07T20:32:31.8327150Z x0 = x[:, :D] 2025-05-07T20:32:31.8327371Z x1 = x[:, D:] 2025-05-07T20:32:31.8327574Z 2025-05-07T20:32:31.8327752Z if contiguous: 2025-05-07T20:32:31.8327976Z x0 = x0.contiguous() 2025-05-07T20:32:31.8328233Z x1 = x1.contiguous() 2025-05-07T20:32:31.8328472Z 2025-05-07T20:32:31.8328658Z if scale_ub is not None: 2025-05-07T20:32:31.8328936Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:31.8329271Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:31.8329578Z ) 2025-05-07T20:32:31.8329772Z else: 2025-05-07T20:32:31.8330006Z scale_ub_tensor = None 2025-05-07T20:32:31.8330279Z 2025-05-07T20:32:31.8330511Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:31.8330816Z op = silu_mul_quant 2025-05-07T20:32:31.8331060Z if compiled: 2025-05-07T20:32:31.8331299Z op = torch.compile(op) 2025-05-07T20:32:31.8331602Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:31.8331872Z 2025-05-07T20:32:31.8332056Z > y_fp8, y_scale = fn() 2025-05-07T20:32:31.8332221Z 2025-05-07T20:32:31.8332325Z moe/activation_test.py:117: 2025-05-07T20:32:31.8332626Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:31.8332950Z moe/activation_test.py:115: in fn 2025-05-07T20:32:31.8333232Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:31.8333927Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:31.8334630Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:31.8335168Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:31.8335853Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:31.8336606Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:31.8337138Z kernel = self.compile( 2025-05-07T20:32:31.8337681Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:31.8338336Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:31.8338736Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:31.8338962Z 2025-05-07T20:32:31.8339173Z self = 2025-05-07T20:32:31.8340407Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:31.8341804Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fcec1a0d7e0>} 2025-05-07T20:32:31.8343157Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:31.8344181Z context = 2025-05-07T20:32:31.8344465Z 2025-05-07T20:32:31.8344630Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:31.8345157Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:31.8345628Z module_map=module_map) 2025-05-07T20:32:31.8345986Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:31.8346344Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:31.8346606Z E ^ 2025-05-07T20:32:31.8347120Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:31.8347574Z 2025-05-07T20:32:31.8347987Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:31.8348505Z 2025-05-07T20:32:31.8348607Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:31.8349014Z self=, 2025-05-07T20:32:31.8349398Z T=4096, 2025-05-07T20:32:31.8349580Z D=7168, 2025-05-07T20:32:31.8349774Z scale_ub=1200.0, 2025-05-07T20:32:31.8349992Z contiguous=False, 2025-05-07T20:32:31.8350217Z compiled=False, 2025-05-07T20:32:31.8350420Z ) 2025-05-07T20:32:31.8350734Z self = 2025-05-07T20:32:31.8351212Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:31.8351486Z 2025-05-07T20:32:31.8351559Z @given( 2025-05-07T20:32:31.8351789Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:31.8352090Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:31.8352395Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:31.8352715Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:31.8353030Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:31.8353336Z ) 2025-05-07T20:32:31.8353680Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:31.8354115Z def test_silu_mul_quant( 2025-05-07T20:32:31.8354357Z self, 2025-05-07T20:32:31.8354545Z T: int, 2025-05-07T20:32:31.8354733Z D: int, 2025-05-07T20:32:31.8354949Z scale_ub: Optional[float], 2025-05-07T20:32:31.8355214Z contiguous: bool, 2025-05-07T20:32:31.8355452Z compiled: bool, 2025-05-07T20:32:31.8355673Z ) -> None: 2025-05-07T20:32:31.8355883Z torch.manual_seed(2025) 2025-05-07T20:32:31.8356167Z 2025-05-07T20:32:31.8356445Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:31.8356778Z 2025-05-07T20:32:31.8356970Z x_sign = torch.sign(x) 2025-05-07T20:32:31.8357258Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:31.8357558Z x = x_sign * x_clamp 2025-05-07T20:32:31.8357795Z x0 = x[:, :D] 2025-05-07T20:32:31.8358006Z x1 = x[:, D:] 2025-05-07T20:32:31.8358202Z 2025-05-07T20:32:31.8358384Z if contiguous: 2025-05-07T20:32:31.8358617Z x0 = x0.contiguous() 2025-05-07T20:32:31.8358913Z x1 = x1.contiguous() 2025-05-07T20:32:31.8359152Z 2025-05-07T20:32:31.8359385Z if scale_ub is not None: 2025-05-07T20:32:31.8359657Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:31.8360017Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:31.8360334Z ) 2025-05-07T20:32:31.8366656Z else: 2025-05-07T20:32:31.8366889Z scale_ub_tensor = None 2025-05-07T20:32:31.8367156Z 2025-05-07T20:32:31.8367404Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:31.8367727Z op = silu_mul_quant 2025-05-07T20:32:31.8367980Z if compiled: 2025-05-07T20:32:31.8368233Z op = torch.compile(op) 2025-05-07T20:32:31.8368531Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:31.8368799Z 2025-05-07T20:32:31.8368994Z > y_fp8, y_scale = fn() 2025-05-07T20:32:31.8369161Z 2025-05-07T20:32:31.8369269Z moe/activation_test.py:117: 2025-05-07T20:32:31.8369567Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:31.8369907Z moe/activation_test.py:115: in fn 2025-05-07T20:32:31.8370193Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:31.8370887Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:31.8371648Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:31.8372195Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:31.8372873Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:31.8373530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:31.8374061Z kernel = self.compile( 2025-05-07T20:32:31.8374609Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:31.8375275Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:31.8375666Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:31.8375897Z 2025-05-07T20:32:31.8376107Z self = 2025-05-07T20:32:31.8377186Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:31.8378582Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fcec1a0e200>} 2025-05-07T20:32:31.8380010Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:31.8381051Z context = 2025-05-07T20:32:31.8381346Z 2025-05-07T20:32:31.8381513Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:31.8382037Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:31.8382555Z module_map=module_map) 2025-05-07T20:32:31.8382920Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:31.8383278Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:31.8383538Z E ^ 2025-05-07T20:32:31.8384004Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:31.8384455Z 2025-05-07T20:32:31.8384866Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:31.8385430Z 2025-05-07T20:32:31.8385542Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:31.8385986Z self=, 2025-05-07T20:32:31.8386392Z T=16384, 2025-05-07T20:32:31.8386585Z D=7168, 2025-05-07T20:32:31.8386779Z scale_ub=None, 2025-05-07T20:32:31.8386987Z contiguous=True, 2025-05-07T20:32:31.8387213Z compiled=True, 2025-05-07T20:32:31.8387417Z ) 2025-05-07T20:32:32.0317919Z self = 2025-05-07T20:32:32.0318512Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:32.0318786Z 2025-05-07T20:32:32.0318870Z @given( 2025-05-07T20:32:32.0319210Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.0319555Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.0319859Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.0320205Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.0320537Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.0320815Z ) 2025-05-07T20:32:32.0321168Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.0321613Z def test_silu_mul_quant( 2025-05-07T20:32:32.0321848Z self, 2025-05-07T20:32:32.0322047Z T: int, 2025-05-07T20:32:32.0322252Z D: int, 2025-05-07T20:32:32.0322601Z scale_ub: Optional[float], 2025-05-07T20:32:32.0322875Z contiguous: bool, 2025-05-07T20:32:32.0323120Z compiled: bool, 2025-05-07T20:32:32.0323357Z ) -> None: 2025-05-07T20:32:32.0323575Z torch.manual_seed(2025) 2025-05-07T20:32:32.0323826Z 2025-05-07T20:32:32.0324107Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.0324443Z 2025-05-07T20:32:32.0324644Z x_sign = torch.sign(x) 2025-05-07T20:32:32.0324941Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.0325245Z x = x_sign * x_clamp 2025-05-07T20:32:32.0325486Z x0 = x[:, :D] 2025-05-07T20:32:32.0325708Z x1 = x[:, D:] 2025-05-07T20:32:32.0325911Z 2025-05-07T20:32:32.0326105Z if contiguous: 2025-05-07T20:32:32.0326339Z x0 = x0.contiguous() 2025-05-07T20:32:32.0326605Z x1 = x1.contiguous() 2025-05-07T20:32:32.0326851Z 2025-05-07T20:32:32.0327057Z if scale_ub is not None: 2025-05-07T20:32:32.0327330Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:32.0327655Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:32.0327964Z ) 2025-05-07T20:32:32.0328153Z else: 2025-05-07T20:32:32.0328366Z scale_ub_tensor = None 2025-05-07T20:32:32.0328617Z 2025-05-07T20:32:32.0328854Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.0329163Z op = silu_mul_quant 2025-05-07T20:32:32.0329413Z if compiled: 2025-05-07T20:32:32.0329685Z op = torch.compile(op) 2025-05-07T20:32:32.0330072Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.0330352Z 2025-05-07T20:32:32.0330545Z > y_fp8, y_scale = fn() 2025-05-07T20:32:32.0330710Z 2025-05-07T20:32:32.0330815Z moe/activation_test.py:117: 2025-05-07T20:32:32.0331128Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.0331560Z moe/activation_test.py:115: in fn 2025-05-07T20:32:32.0331854Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.0332414Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:32.0332974Z return fn(*args, **kwargs) 2025-05-07T20:32:32.0333633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:32.0334316Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:32.0334923Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:32.0335663Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:32.0336324Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:32.0336849Z kernel = self.compile( 2025-05-07T20:32:32.0337399Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:32.0338054Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.0338450Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.0338672Z 2025-05-07T20:32:32.0338883Z self = 2025-05-07T20:32:32.0340061Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:32.0341561Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fcec1a0f760>} 2025-05-07T20:32:32.0342968Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:32.0343993Z context = 2025-05-07T20:32:32.0344289Z 2025-05-07T20:32:32.0344456Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:32.0344973Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.0345448Z module_map=module_map) 2025-05-07T20:32:32.0345812Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.0346173Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:32.0346443Z E ^ 2025-05-07T20:32:32.0346909Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.0347368Z 2025-05-07T20:32:32.0347792Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:32.0348317Z 2025-05-07T20:32:32.0348426Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.0348844Z self=, 2025-05-07T20:32:32.0349236Z T=4096, 2025-05-07T20:32:32.0349434Z D=5120, 2025-05-07T20:32:32.0349634Z scale_ub=None, 2025-05-07T20:32:32.0349855Z contiguous=False, 2025-05-07T20:32:32.0350075Z compiled=True, 2025-05-07T20:32:32.0350280Z ) 2025-05-07T20:32:32.0350598Z self = 2025-05-07T20:32:32.0351105Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:32.0351477Z 2025-05-07T20:32:32.0351562Z @given( 2025-05-07T20:32:32.0351792Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.0352105Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.0352473Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.0352810Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.0353138Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.0353423Z ) 2025-05-07T20:32:32.0353780Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.0354224Z def test_silu_mul_quant( 2025-05-07T20:32:32.0354467Z self, 2025-05-07T20:32:32.0354667Z T: int, 2025-05-07T20:32:32.0354864Z D: int, 2025-05-07T20:32:32.0355144Z scale_ub: Optional[float], 2025-05-07T20:32:32.0355414Z contiguous: bool, 2025-05-07T20:32:32.0355697Z compiled: bool, 2025-05-07T20:32:32.0355921Z ) -> None: 2025-05-07T20:32:32.0356146Z torch.manual_seed(2025) 2025-05-07T20:32:32.0356390Z 2025-05-07T20:32:32.0356665Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.0357007Z 2025-05-07T20:32:32.0357209Z x_sign = torch.sign(x) 2025-05-07T20:32:32.0357499Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.0357812Z x = x_sign * x_clamp 2025-05-07T20:32:32.0358057Z x0 = x[:, :D] 2025-05-07T20:32:32.0358274Z x1 = x[:, D:] 2025-05-07T20:32:32.0358481Z 2025-05-07T20:32:32.0358668Z if contiguous: 2025-05-07T20:32:32.0358900Z x0 = x0.contiguous() 2025-05-07T20:32:32.0359164Z x1 = x1.contiguous() 2025-05-07T20:32:32.0359406Z 2025-05-07T20:32:32.0359599Z if scale_ub is not None: 2025-05-07T20:32:32.0359897Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:32.0360266Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:32.0360578Z ) 2025-05-07T20:32:32.0360763Z else: 2025-05-07T20:32:32.0360978Z scale_ub_tensor = None 2025-05-07T20:32:32.0361230Z 2025-05-07T20:32:32.0361461Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.0361832Z op = silu_mul_quant 2025-05-07T20:32:32.0362181Z if compiled: 2025-05-07T20:32:32.0362431Z op = torch.compile(op) 2025-05-07T20:32:32.0362730Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.0363004Z 2025-05-07T20:32:32.0363193Z > y_fp8, y_scale = fn() 2025-05-07T20:32:32.0363366Z 2025-05-07T20:32:32.0363470Z moe/activation_test.py:117: 2025-05-07T20:32:32.0363762Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.0364093Z moe/activation_test.py:115: in fn 2025-05-07T20:32:32.0364375Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.0364934Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:32.0365494Z return fn(*args, **kwargs) 2025-05-07T20:32:32.0366154Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:32.0366842Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:32.0367376Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:32.0368061Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:32.0368721Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:32.0369252Z kernel = self.compile( 2025-05-07T20:32:32.0369797Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:32.0370460Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.0370851Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.0371084Z 2025-05-07T20:32:32.0371297Z self = 2025-05-07T20:32:32.0372436Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:32.0373919Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd0960f0280>} 2025-05-07T20:32:32.0375344Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:32.0376422Z context = 2025-05-07T20:32:32.0376708Z 2025-05-07T20:32:32.0376880Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:32.0377403Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.0377868Z module_map=module_map) 2025-05-07T20:32:32.0378232Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.0378585Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:32.0378842Z E ^ 2025-05-07T20:32:32.0379301Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.0379750Z 2025-05-07T20:32:32.0380280Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:32.0380797Z 2025-05-07T20:32:32.3665002Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.3665975Z self=, 2025-05-07T20:32:32.3666774Z T=4096, 2025-05-07T20:32:32.3667142Z D=5120, 2025-05-07T20:32:32.3667513Z scale_ub=1200.0, 2025-05-07T20:32:32.3667959Z contiguous=False, 2025-05-07T20:32:32.3668606Z compiled=False, 2025-05-07T20:32:32.3669003Z ) 2025-05-07T20:32:32.3669628Z self = 2025-05-07T20:32:32.3670252Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:32.3670528Z 2025-05-07T20:32:32.3670614Z @given( 2025-05-07T20:32:32.3670843Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.3671161Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.3671470Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.3671794Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.3672132Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.3672415Z ) 2025-05-07T20:32:32.3672764Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.3673210Z def test_silu_mul_quant( 2025-05-07T20:32:32.3673461Z self, 2025-05-07T20:32:32.3673658Z T: int, 2025-05-07T20:32:32.3673853Z D: int, 2025-05-07T20:32:32.3674073Z scale_ub: Optional[float], 2025-05-07T20:32:32.3674347Z contiguous: bool, 2025-05-07T20:32:32.3674587Z compiled: bool, 2025-05-07T20:32:32.3674809Z ) -> None: 2025-05-07T20:32:32.3675029Z torch.manual_seed(2025) 2025-05-07T20:32:32.3675265Z 2025-05-07T20:32:32.3675543Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.3675897Z 2025-05-07T20:32:32.3676093Z x_sign = torch.sign(x) 2025-05-07T20:32:32.3676389Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.3676702Z x = x_sign * x_clamp 2025-05-07T20:32:32.3676941Z x0 = x[:, :D] 2025-05-07T20:32:32.3677154Z x1 = x[:, D:] 2025-05-07T20:32:32.3677357Z 2025-05-07T20:32:32.3677542Z if contiguous: 2025-05-07T20:32:32.3677776Z x0 = x0.contiguous() 2025-05-07T20:32:32.3678116Z x1 = x1.contiguous() 2025-05-07T20:32:32.3678352Z 2025-05-07T20:32:32.3678542Z if scale_ub is not None: 2025-05-07T20:32:32.3678817Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:32.3679154Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:32.3679455Z ) 2025-05-07T20:32:32.3679647Z else: 2025-05-07T20:32:32.3679861Z scale_ub_tensor = None 2025-05-07T20:32:32.3680101Z 2025-05-07T20:32:32.3680327Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.3680713Z op = silu_mul_quant 2025-05-07T20:32:32.3680957Z if compiled: 2025-05-07T20:32:32.3681264Z op = torch.compile(op) 2025-05-07T20:32:32.3681564Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.3681830Z 2025-05-07T20:32:32.3682020Z > y_fp8, y_scale = fn() 2025-05-07T20:32:32.3682182Z 2025-05-07T20:32:32.3682287Z moe/activation_test.py:117: 2025-05-07T20:32:32.3682580Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.3682907Z moe/activation_test.py:115: in fn 2025-05-07T20:32:32.3683187Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.3683872Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:32.3684550Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:32.3685088Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:32.3685768Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:32.3686424Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:32.3686950Z kernel = self.compile( 2025-05-07T20:32:32.3687482Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:32.3688191Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.3688579Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.3688807Z 2025-05-07T20:32:32.3689016Z self = 2025-05-07T20:32:32.3690434Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:32.3691808Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd0960f1000>} 2025-05-07T20:32:32.3693151Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:32.3694155Z context = 2025-05-07T20:32:32.3694440Z 2025-05-07T20:32:32.3694602Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:32.3695113Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.3695572Z module_map=module_map) 2025-05-07T20:32:32.3695932Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.3696283Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:32.3696537Z E ^ 2025-05-07T20:32:32.3696991Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.3697439Z 2025-05-07T20:32:32.3697849Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:32.3698451Z 2025-05-07T20:32:32.3698558Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.3698964Z self=, 2025-05-07T20:32:32.3699353Z T=4096, 2025-05-07T20:32:32.3699533Z D=5120, 2025-05-07T20:32:32.3699718Z scale_ub=1200.0, 2025-05-07T20:32:32.3700010Z contiguous=False, 2025-05-07T20:32:32.3700254Z compiled=True, 2025-05-07T20:32:32.3700475Z ) 2025-05-07T20:32:32.3700780Z self = 2025-05-07T20:32:32.3701350Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:32.3701617Z 2025-05-07T20:32:32.3701758Z @given( 2025-05-07T20:32:32.3701985Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.3702297Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.3702603Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.3702958Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.3703291Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.3703566Z ) 2025-05-07T20:32:32.3703918Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.3704350Z def test_silu_mul_quant( 2025-05-07T20:32:32.3704580Z self, 2025-05-07T20:32:32.3704771Z T: int, 2025-05-07T20:32:32.3704965Z D: int, 2025-05-07T20:32:32.3705176Z scale_ub: Optional[float], 2025-05-07T20:32:32.3705445Z contiguous: bool, 2025-05-07T20:32:32.3705687Z compiled: bool, 2025-05-07T20:32:32.3705901Z ) -> None: 2025-05-07T20:32:32.3706118Z torch.manual_seed(2025) 2025-05-07T20:32:32.3706362Z 2025-05-07T20:32:32.3706627Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.3706964Z 2025-05-07T20:32:32.3707158Z x_sign = torch.sign(x) 2025-05-07T20:32:32.3707440Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.3707815Z x = x_sign * x_clamp 2025-05-07T20:32:32.3708058Z x0 = x[:, :D] 2025-05-07T20:32:32.3708273Z x1 = x[:, D:] 2025-05-07T20:32:32.3708469Z 2025-05-07T20:32:32.3708650Z if contiguous: 2025-05-07T20:32:32.3708880Z x0 = x0.contiguous() 2025-05-07T20:32:32.3709125Z x1 = x1.contiguous() 2025-05-07T20:32:32.3709357Z 2025-05-07T20:32:32.3709543Z if scale_ub is not None: 2025-05-07T20:32:32.3709810Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:32.3710148Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:32.3710455Z ) 2025-05-07T20:32:32.3710645Z else: 2025-05-07T20:32:32.3710853Z scale_ub_tensor = None 2025-05-07T20:32:32.3711095Z 2025-05-07T20:32:32.3711315Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.3711622Z op = silu_mul_quant 2025-05-07T20:32:32.3711873Z if compiled: 2025-05-07T20:32:32.3712116Z op = torch.compile(op) 2025-05-07T20:32:32.3712404Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.3712670Z 2025-05-07T20:32:32.3712860Z > y_fp8, y_scale = fn() 2025-05-07T20:32:32.3713023Z 2025-05-07T20:32:32.3713121Z moe/activation_test.py:117: 2025-05-07T20:32:32.3713418Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.3713744Z moe/activation_test.py:115: in fn 2025-05-07T20:32:32.3714014Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.3714572Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:32.3715127Z return fn(*args, **kwargs) 2025-05-07T20:32:32.3715775Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:32.3716463Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:32.3717047Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:32.3717717Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:32.3718370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:32.3718894Z kernel = self.compile( 2025-05-07T20:32:32.3719425Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:32.3720120Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.3720552Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.3720783Z 2025-05-07T20:32:32.3720989Z self = 2025-05-07T20:32:32.3722055Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:32.3723412Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd0960f0700>} 2025-05-07T20:32:32.3724731Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:32.3725750Z context = 2025-05-07T20:32:32.3726042Z 2025-05-07T20:32:32.3726206Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:32.3726719Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.3727187Z module_map=module_map) 2025-05-07T20:32:32.3727591Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.3727948Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:32.3728201Z E ^ 2025-05-07T20:32:32.3728660Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.3729104Z 2025-05-07T20:32:32.3729526Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:32.3730046Z 2025-05-07T20:32:32.5023997Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.5024528Z self=, 2025-05-07T20:32:32.5024953Z T=2048, 2025-05-07T20:32:32.5025146Z D=7168, 2025-05-07T20:32:32.5025341Z scale_ub=1200.0, 2025-05-07T20:32:32.5025573Z contiguous=False, 2025-05-07T20:32:32.5025797Z compiled=False, 2025-05-07T20:32:32.5026005Z ) 2025-05-07T20:32:32.5026327Z self = 2025-05-07T20:32:32.5026825Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:32.5027104Z 2025-05-07T20:32:32.5027183Z @given( 2025-05-07T20:32:32.5027417Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.5027721Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.5028033Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.5028365Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.5028700Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.5028978Z ) 2025-05-07T20:32:32.5029334Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.5029775Z def test_silu_mul_quant( 2025-05-07T20:32:32.5030014Z self, 2025-05-07T20:32:32.5030211Z T: int, 2025-05-07T20:32:32.5030415Z D: int, 2025-05-07T20:32:32.5030752Z scale_ub: Optional[float], 2025-05-07T20:32:32.5031032Z contiguous: bool, 2025-05-07T20:32:32.5031272Z compiled: bool, 2025-05-07T20:32:32.5031499Z ) -> None: 2025-05-07T20:32:32.5031725Z torch.manual_seed(2025) 2025-05-07T20:32:32.5031975Z 2025-05-07T20:32:32.5032247Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.5032592Z 2025-05-07T20:32:32.5032791Z x_sign = torch.sign(x) 2025-05-07T20:32:32.5033079Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.5033480Z x = x_sign * x_clamp 2025-05-07T20:32:32.5033724Z x0 = x[:, :D] 2025-05-07T20:32:32.5033944Z x1 = x[:, D:] 2025-05-07T20:32:32.5034207Z 2025-05-07T20:32:32.5034400Z if contiguous: 2025-05-07T20:32:32.5034634Z x0 = x0.contiguous() 2025-05-07T20:32:32.5034888Z x1 = x1.contiguous() 2025-05-07T20:32:32.5035124Z 2025-05-07T20:32:32.5035324Z if scale_ub is not None: 2025-05-07T20:32:32.5035597Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:32.5035940Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:32.5036258Z ) 2025-05-07T20:32:32.5036447Z else: 2025-05-07T20:32:32.5036663Z scale_ub_tensor = None 2025-05-07T20:32:32.5036910Z 2025-05-07T20:32:32.5037171Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.5037490Z op = silu_mul_quant 2025-05-07T20:32:32.5037734Z if compiled: 2025-05-07T20:32:32.5037989Z op = torch.compile(op) 2025-05-07T20:32:32.5038284Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.5038554Z 2025-05-07T20:32:32.5038756Z > y_fp8, y_scale = fn() 2025-05-07T20:32:32.5038920Z 2025-05-07T20:32:32.5039024Z moe/activation_test.py:117: 2025-05-07T20:32:32.5039311Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.5039642Z moe/activation_test.py:115: in fn 2025-05-07T20:32:32.5039992Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.5040688Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:32.5041366Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:32.5041906Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:32.5042582Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:32.5043242Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:32.5043770Z kernel = self.compile( 2025-05-07T20:32:32.5044316Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:32.5044959Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.5045358Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.5045585Z 2025-05-07T20:32:32.5045790Z self = 2025-05-07T20:32:32.5046856Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:32.5048227Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd0960f1240>} 2025-05-07T20:32:32.5049557Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:32.5050576Z context = 2025-05-07T20:32:32.5050913Z 2025-05-07T20:32:32.5051080Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:32.5051604Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.5052062Z module_map=module_map) 2025-05-07T20:32:32.5052429Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.5052779Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:32.5053029Z E ^ 2025-05-07T20:32:32.5053534Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.5054017Z 2025-05-07T20:32:32.5054435Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:32.5054947Z 2025-05-07T20:32:32.5055057Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.5055467Z self=, 2025-05-07T20:32:32.5055867Z T=1, 2025-05-07T20:32:32.5056048Z D=7168, 2025-05-07T20:32:32.5056238Z scale_ub=None, 2025-05-07T20:32:32.5056454Z contiguous=True, 2025-05-07T20:32:32.5056676Z compiled=False, 2025-05-07T20:32:32.5056876Z ) 2025-05-07T20:32:32.5057192Z self = 2025-05-07T20:32:32.5057676Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:32.5057930Z 2025-05-07T20:32:32.5058018Z @given( 2025-05-07T20:32:32.5058244Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.5058559Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.5058868Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.5059189Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.5059514Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.5059871Z ) 2025-05-07T20:32:32.5060265Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.5060703Z def test_silu_mul_quant( 2025-05-07T20:32:32.5060945Z self, 2025-05-07T20:32:32.5061132Z T: int, 2025-05-07T20:32:32.5061330Z D: int, 2025-05-07T20:32:32.5061552Z scale_ub: Optional[float], 2025-05-07T20:32:32.5061822Z contiguous: bool, 2025-05-07T20:32:32.5062054Z compiled: bool, 2025-05-07T20:32:32.5062278Z ) -> None: 2025-05-07T20:32:32.5062497Z torch.manual_seed(2025) 2025-05-07T20:32:32.5062733Z 2025-05-07T20:32:32.5063005Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.5063345Z 2025-05-07T20:32:32.5063532Z x_sign = torch.sign(x) 2025-05-07T20:32:32.5063820Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.5064124Z x = x_sign * x_clamp 2025-05-07T20:32:32.5064357Z x0 = x[:, :D] 2025-05-07T20:32:32.5064585Z x1 = x[:, D:] 2025-05-07T20:32:32.5064797Z 2025-05-07T20:32:32.5064981Z if contiguous: 2025-05-07T20:32:32.5065216Z x0 = x0.contiguous() 2025-05-07T20:32:32.5065476Z x1 = x1.contiguous() 2025-05-07T20:32:32.5065707Z 2025-05-07T20:32:32.5065901Z if scale_ub is not None: 2025-05-07T20:32:32.5066173Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:32.5066502Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:32.5066803Z ) 2025-05-07T20:32:32.5066996Z else: 2025-05-07T20:32:32.5067209Z scale_ub_tensor = None 2025-05-07T20:32:32.5067452Z 2025-05-07T20:32:32.5067684Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.5067995Z op = silu_mul_quant 2025-05-07T20:32:32.5068238Z if compiled: 2025-05-07T20:32:32.5068489Z op = torch.compile(op) 2025-05-07T20:32:32.5068784Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.5069105Z 2025-05-07T20:32:32.5069303Z > y_fp8, y_scale = fn() 2025-05-07T20:32:32.5069466Z 2025-05-07T20:32:32.5069573Z moe/activation_test.py:117: 2025-05-07T20:32:32.5069868Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.5070196Z moe/activation_test.py:115: in fn 2025-05-07T20:32:32.5070479Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.5071160Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:32.5071895Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:32.5072530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:32.5073220Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:32.5073872Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:32.5074403Z kernel = self.compile( 2025-05-07T20:32:32.5074941Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:32.5075589Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.5075977Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.5076208Z 2025-05-07T20:32:32.5076414Z self = 2025-05-07T20:32:32.5077490Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:32.5078905Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd0960f2050>} 2025-05-07T20:32:32.5080300Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:32.5081331Z context = 2025-05-07T20:32:32.5081619Z 2025-05-07T20:32:32.5081782Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:32.5082295Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.5082756Z module_map=module_map) 2025-05-07T20:32:32.5083122Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.5083473Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:32.5083729Z E ^ 2025-05-07T20:32:32.5084181Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.5084634Z 2025-05-07T20:32:32.5085045Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:32.5085551Z 2025-05-07T20:32:32.5085662Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.5086066Z self=, 2025-05-07T20:32:32.5086459Z T=16384, 2025-05-07T20:32:32.5086653Z D=7168, 2025-05-07T20:32:32.5086848Z scale_ub=1200.0, 2025-05-07T20:32:32.5087071Z contiguous=False, 2025-05-07T20:32:32.5087295Z compiled=True, 2025-05-07T20:32:32.7715879Z ) 2025-05-07T20:32:32.7716247Z self = 2025-05-07T20:32:32.7716823Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:32.7717119Z 2025-05-07T20:32:32.7717199Z @given( 2025-05-07T20:32:32.7717440Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.7717868Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.7718181Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.7718519Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.7718852Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.7719134Z ) 2025-05-07T20:32:32.7719491Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.7719928Z def test_silu_mul_quant( 2025-05-07T20:32:32.7720166Z self, 2025-05-07T20:32:32.7720436Z T: int, 2025-05-07T20:32:32.7720635Z D: int, 2025-05-07T20:32:32.7720852Z scale_ub: Optional[float], 2025-05-07T20:32:32.7721210Z contiguous: bool, 2025-05-07T20:32:32.7721458Z compiled: bool, 2025-05-07T20:32:32.7721684Z ) -> None: 2025-05-07T20:32:32.7721905Z torch.manual_seed(2025) 2025-05-07T20:32:32.7722146Z 2025-05-07T20:32:32.7722418Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.7722761Z 2025-05-07T20:32:32.7722956Z x_sign = torch.sign(x) 2025-05-07T20:32:32.7723258Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.7723564Z x = x_sign * x_clamp 2025-05-07T20:32:32.7723810Z x0 = x[:, :D] 2025-05-07T20:32:32.7724032Z x1 = x[:, D:] 2025-05-07T20:32:32.7724234Z 2025-05-07T20:32:32.7724424Z if contiguous: 2025-05-07T20:32:32.7724661Z x0 = x0.contiguous() 2025-05-07T20:32:32.7724922Z x1 = x1.contiguous() 2025-05-07T20:32:32.7725166Z 2025-05-07T20:32:32.7725366Z if scale_ub is not None: 2025-05-07T20:32:32.7725641Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:32.7725975Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:32.7726284Z ) 2025-05-07T20:32:32.7726468Z else: 2025-05-07T20:32:32.7726685Z scale_ub_tensor = None 2025-05-07T20:32:32.7726941Z 2025-05-07T20:32:32.7727231Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.7727550Z op = silu_mul_quant 2025-05-07T20:32:32.7727806Z if compiled: 2025-05-07T20:32:32.7728058Z op = torch.compile(op) 2025-05-07T20:32:32.7728358Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.7728634Z 2025-05-07T20:32:32.7728828Z > y_fp8, y_scale = fn() 2025-05-07T20:32:32.7728995Z 2025-05-07T20:32:32.7729097Z moe/activation_test.py:117: 2025-05-07T20:32:32.7729400Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.7729732Z moe/activation_test.py:115: in fn 2025-05-07T20:32:32.7730017Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.7730578Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:32.7731138Z return fn(*args, **kwargs) 2025-05-07T20:32:32.7731799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:32.7732476Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:32.7733013Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:32.7733694Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:32.7734353Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:32.7734881Z kernel = self.compile( 2025-05-07T20:32:32.7735428Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:32.7736074Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.7736463Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.7736749Z 2025-05-07T20:32:32.7736965Z self = 2025-05-07T20:32:32.7738052Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:32.7739430Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd0960f3490>} 2025-05-07T20:32:32.7740980Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:32.7742011Z context = 2025-05-07T20:32:32.7742299Z 2025-05-07T20:32:32.7742463Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:32.7742979Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.7743435Z module_map=module_map) 2025-05-07T20:32:32.7743802Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.7744160Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:32.7744420Z E ^ 2025-05-07T20:32:32.7744875Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.7745333Z 2025-05-07T20:32:32.7745747Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:32.7746252Z 2025-05-07T20:32:32.7746364Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.7746772Z self=, 2025-05-07T20:32:32.7747161Z T=1, 2025-05-07T20:32:32.7747345Z D=7168, 2025-05-07T20:32:32.7747582Z scale_ub=None, 2025-05-07T20:32:32.7747797Z contiguous=False, 2025-05-07T20:32:32.7748022Z compiled=False, 2025-05-07T20:32:32.7748220Z ) 2025-05-07T20:32:32.7748529Z self = 2025-05-07T20:32:32.7749009Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:32.7749265Z 2025-05-07T20:32:32.7749345Z @given( 2025-05-07T20:32:32.7749568Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.7749879Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.7750208Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.7750555Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.7750877Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.7751159Z ) 2025-05-07T20:32:32.7751507Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.7751947Z def test_silu_mul_quant( 2025-05-07T20:32:32.7752186Z self, 2025-05-07T20:32:32.7752380Z T: int, 2025-05-07T20:32:32.7752572Z D: int, 2025-05-07T20:32:32.7752790Z scale_ub: Optional[float], 2025-05-07T20:32:32.7753063Z contiguous: bool, 2025-05-07T20:32:32.7753297Z compiled: bool, 2025-05-07T20:32:32.7753519Z ) -> None: 2025-05-07T20:32:32.7753733Z torch.manual_seed(2025) 2025-05-07T20:32:32.7753965Z 2025-05-07T20:32:32.7754232Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.7754565Z 2025-05-07T20:32:32.7754749Z x_sign = torch.sign(x) 2025-05-07T20:32:32.7755040Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.7755343Z x = x_sign * x_clamp 2025-05-07T20:32:32.7755583Z x0 = x[:, :D] 2025-05-07T20:32:32.7755810Z x1 = x[:, D:] 2025-05-07T20:32:32.7756014Z 2025-05-07T20:32:32.7756198Z if contiguous: 2025-05-07T20:32:32.7756481Z x0 = x0.contiguous() 2025-05-07T20:32:32.7756733Z x1 = x1.contiguous() 2025-05-07T20:32:32.7756970Z 2025-05-07T20:32:32.7757154Z if scale_ub is not None: 2025-05-07T20:32:32.7757417Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:32.7757744Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:32.7758041Z ) 2025-05-07T20:32:32.7758227Z else: 2025-05-07T20:32:32.7758435Z scale_ub_tensor = None 2025-05-07T20:32:32.7758688Z 2025-05-07T20:32:32.7758962Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.7759272Z op = silu_mul_quant 2025-05-07T20:32:32.7759561Z if compiled: 2025-05-07T20:32:32.7759807Z op = torch.compile(op) 2025-05-07T20:32:32.7760122Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.7760419Z 2025-05-07T20:32:32.7760607Z > y_fp8, y_scale = fn() 2025-05-07T20:32:32.7760780Z 2025-05-07T20:32:32.7760879Z moe/activation_test.py:117: 2025-05-07T20:32:32.7761177Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.7761502Z moe/activation_test.py:115: in fn 2025-05-07T20:32:32.7761779Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.7762462Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:32.7763144Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:32.7763678Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:32.7764355Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:32.7765016Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:32.7765533Z kernel = self.compile( 2025-05-07T20:32:32.7766119Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:32.7766777Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.7767164Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.7767387Z 2025-05-07T20:32:32.7767594Z self = 2025-05-07T20:32:32.7768658Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:32.7770017Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd0960f37f0>} 2025-05-07T20:32:32.7771349Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:32.7772359Z context = 2025-05-07T20:32:32.7772649Z 2025-05-07T20:32:32.7772812Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:32.7773326Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.7773785Z module_map=module_map) 2025-05-07T20:32:32.7774147Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.7774492Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:32.7774747Z E ^ 2025-05-07T20:32:32.7775206Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.7775645Z 2025-05-07T20:32:32.7776054Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:32.7776611Z 2025-05-07T20:32:32.7776714Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.7777120Z self=, 2025-05-07T20:32:32.7777511Z T=2048, 2025-05-07T20:32:32.7777688Z D=7168, 2025-05-07T20:32:32.7777876Z scale_ub=None, 2025-05-07T20:32:32.7778087Z contiguous=False, 2025-05-07T20:32:32.7778304Z compiled=True, 2025-05-07T20:32:32.7778500Z ) 2025-05-07T20:32:32.8779002Z self = 2025-05-07T20:32:32.8779674Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:32.8780035Z 2025-05-07T20:32:32.8780114Z @given( 2025-05-07T20:32:32.8780352Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.8780664Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.8780960Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.8781301Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.8781634Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.8781921Z ) 2025-05-07T20:32:32.8782269Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.8782711Z def test_silu_mul_quant( 2025-05-07T20:32:32.8782953Z self, 2025-05-07T20:32:32.8783151Z T: int, 2025-05-07T20:32:32.8783354Z D: int, 2025-05-07T20:32:32.8783584Z scale_ub: Optional[float], 2025-05-07T20:32:32.8783858Z contiguous: bool, 2025-05-07T20:32:32.8784099Z compiled: bool, 2025-05-07T20:32:32.8784332Z ) -> None: 2025-05-07T20:32:32.8784544Z torch.manual_seed(2025) 2025-05-07T20:32:32.8784789Z 2025-05-07T20:32:32.8785065Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.8785400Z 2025-05-07T20:32:32.8785602Z x_sign = torch.sign(x) 2025-05-07T20:32:32.8785960Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.8786288Z x = x_sign * x_clamp 2025-05-07T20:32:32.8786522Z x0 = x[:, :D] 2025-05-07T20:32:32.8786748Z x1 = x[:, D:] 2025-05-07T20:32:32.8786956Z 2025-05-07T20:32:32.8787132Z if contiguous: 2025-05-07T20:32:32.8787369Z x0 = x0.contiguous() 2025-05-07T20:32:32.8787634Z x1 = x1.contiguous() 2025-05-07T20:32:32.8787874Z 2025-05-07T20:32:32.8794499Z if scale_ub is not None: 2025-05-07T20:32:32.8794806Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:32.8795144Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:32.8795463Z ) 2025-05-07T20:32:32.8795663Z else: 2025-05-07T20:32:32.8795869Z scale_ub_tensor = None 2025-05-07T20:32:32.8796119Z 2025-05-07T20:32:32.8796357Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.8796671Z op = silu_mul_quant 2025-05-07T20:32:32.8796926Z if compiled: 2025-05-07T20:32:32.8797177Z op = torch.compile(op) 2025-05-07T20:32:32.8797468Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.8797744Z 2025-05-07T20:32:32.8797942Z > y_fp8, y_scale = fn() 2025-05-07T20:32:32.8798106Z 2025-05-07T20:32:32.8798215Z moe/activation_test.py:117: 2025-05-07T20:32:32.8798512Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.8798843Z moe/activation_test.py:115: in fn 2025-05-07T20:32:32.8799125Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.8799677Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:32.8800240Z return fn(*args, **kwargs) 2025-05-07T20:32:32.8800896Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:32.8801691Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:32.8802221Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:32.8802895Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:32.8803554Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:32.8804076Z kernel = self.compile( 2025-05-07T20:32:32.8804612Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:32.8805393Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.8805790Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.8806014Z 2025-05-07T20:32:32.8806220Z self = 2025-05-07T20:32:32.8807291Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:32.8808655Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fcec1b50af0>} 2025-05-07T20:32:32.8809988Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:32.8811068Z context = 2025-05-07T20:32:32.8811360Z 2025-05-07T20:32:32.8811525Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:32.8812039Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.8812563Z module_map=module_map) 2025-05-07T20:32:32.8812922Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.8813275Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:32.8813532Z E ^ 2025-05-07T20:32:32.8813984Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.8814428Z 2025-05-07T20:32:32.8814840Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:32.8815354Z 2025-05-07T20:32:32.8815456Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.8815866Z self=, 2025-05-07T20:32:32.8816273Z T=4096, 2025-05-07T20:32:32.8816461Z D=7168, 2025-05-07T20:32:32.8816653Z scale_ub=None, 2025-05-07T20:32:32.8816861Z contiguous=False, 2025-05-07T20:32:32.8817091Z compiled=True, 2025-05-07T20:32:32.8817292Z ) 2025-05-07T20:32:32.8817601Z self = 2025-05-07T20:32:32.8818088Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:32.8818357Z 2025-05-07T20:32:32.8818433Z @given( 2025-05-07T20:32:32.8818663Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.8818969Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.8819279Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.8819607Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.8820092Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.8820487Z ) 2025-05-07T20:32:32.8820906Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.8821387Z def test_silu_mul_quant( 2025-05-07T20:32:32.8821634Z self, 2025-05-07T20:32:32.8821897Z T: int, 2025-05-07T20:32:32.8822105Z D: int, 2025-05-07T20:32:32.8822322Z scale_ub: Optional[float], 2025-05-07T20:32:32.8822594Z contiguous: bool, 2025-05-07T20:32:32.8822833Z compiled: bool, 2025-05-07T20:32:32.8823057Z ) -> None: 2025-05-07T20:32:32.8823265Z torch.manual_seed(2025) 2025-05-07T20:32:32.8823507Z 2025-05-07T20:32:32.8823779Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.8824115Z 2025-05-07T20:32:32.8824311Z x_sign = torch.sign(x) 2025-05-07T20:32:32.8824652Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.8824956Z x = x_sign * x_clamp 2025-05-07T20:32:32.8825240Z x0 = x[:, :D] 2025-05-07T20:32:32.8825463Z x1 = x[:, D:] 2025-05-07T20:32:32.8825664Z 2025-05-07T20:32:32.8825850Z if contiguous: 2025-05-07T20:32:32.8826082Z x0 = x0.contiguous() 2025-05-07T20:32:32.8826336Z x1 = x1.contiguous() 2025-05-07T20:32:32.8826575Z 2025-05-07T20:32:32.8826772Z if scale_ub is not None: 2025-05-07T20:32:32.8827039Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:32.8827372Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:32.8827677Z ) 2025-05-07T20:32:32.8827866Z else: 2025-05-07T20:32:32.8828079Z scale_ub_tensor = None 2025-05-07T20:32:32.8828329Z 2025-05-07T20:32:32.8828562Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.8828868Z op = silu_mul_quant 2025-05-07T20:32:32.8829123Z if compiled: 2025-05-07T20:32:32.8829377Z op = torch.compile(op) 2025-05-07T20:32:32.8829670Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.8829938Z 2025-05-07T20:32:32.8830134Z > y_fp8, y_scale = fn() 2025-05-07T20:32:32.8830296Z 2025-05-07T20:32:32.8830397Z moe/activation_test.py:117: 2025-05-07T20:32:32.8830740Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.8831076Z moe/activation_test.py:115: in fn 2025-05-07T20:32:32.8831352Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.8831903Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:32.8832457Z return fn(*args, **kwargs) 2025-05-07T20:32:32.8833113Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:32.8833794Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:32.8834331Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:32.8835003Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:32.8835662Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:32.8836196Z kernel = self.compile( 2025-05-07T20:32:32.8836739Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:32.8837387Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.8837778Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.8838007Z 2025-05-07T20:32:32.8838214Z self = 2025-05-07T20:32:32.8839281Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:32.8840642Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fcec1b50280>} 2025-05-07T20:32:32.8842020Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:32.8843048Z context = 2025-05-07T20:32:32.8843340Z 2025-05-07T20:32:32.8843505Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:32.8844028Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.8844533Z module_map=module_map) 2025-05-07T20:32:32.8844895Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.8845285Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:32.8845551Z E ^ 2025-05-07T20:32:32.8846010Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.8846459Z 2025-05-07T20:32:32.8846884Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:32.8847403Z 2025-05-07T20:32:33.2324099Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:33.2324563Z self=, 2025-05-07T20:32:33.2324999Z T=16384, 2025-05-07T20:32:33.2325202Z D=5120, 2025-05-07T20:32:33.2325398Z scale_ub=1200.0, 2025-05-07T20:32:33.2325627Z contiguous=False, 2025-05-07T20:32:33.2325854Z compiled=False, 2025-05-07T20:32:33.2326075Z ) 2025-05-07T20:32:33.2326397Z self = 2025-05-07T20:32:33.2326901Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:33.2327218Z 2025-05-07T20:32:33.2327299Z @given( 2025-05-07T20:32:33.2327527Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:33.2327844Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:33.2328270Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:33.2328600Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:33.2328932Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:33.2329216Z ) 2025-05-07T20:32:33.2329564Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:33.2330011Z def test_silu_mul_quant( 2025-05-07T20:32:33.2330261Z self, 2025-05-07T20:32:33.2330450Z T: int, 2025-05-07T20:32:33.2330654Z D: int, 2025-05-07T20:32:33.2330880Z scale_ub: Optional[float], 2025-05-07T20:32:33.2331156Z contiguous: bool, 2025-05-07T20:32:33.2331403Z compiled: bool, 2025-05-07T20:32:33.2331639Z ) -> None: 2025-05-07T20:32:33.2331853Z torch.manual_seed(2025) 2025-05-07T20:32:33.2332089Z 2025-05-07T20:32:33.2332364Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:33.2332706Z 2025-05-07T20:32:33.2332900Z x_sign = torch.sign(x) 2025-05-07T20:32:33.2333192Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:33.2333499Z x = x_sign * x_clamp 2025-05-07T20:32:33.2333737Z x0 = x[:, :D] 2025-05-07T20:32:33.2333955Z x1 = x[:, D:] 2025-05-07T20:32:33.2334162Z 2025-05-07T20:32:33.2334340Z if contiguous: 2025-05-07T20:32:33.2334576Z x0 = x0.contiguous() 2025-05-07T20:32:33.2334836Z x1 = x1.contiguous() 2025-05-07T20:32:33.2335076Z 2025-05-07T20:32:33.2335266Z if scale_ub is not None: 2025-05-07T20:32:33.2335542Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:33.2335874Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:33.2336180Z ) 2025-05-07T20:32:33.2336379Z else: 2025-05-07T20:32:33.2336592Z scale_ub_tensor = None 2025-05-07T20:32:33.2336839Z 2025-05-07T20:32:33.2337150Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:33.2337463Z op = silu_mul_quant 2025-05-07T20:32:33.2337719Z if compiled: 2025-05-07T20:32:33.2337970Z op = torch.compile(op) 2025-05-07T20:32:33.2338262Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:33.2338532Z 2025-05-07T20:32:33.2338730Z > y_fp8, y_scale = fn() 2025-05-07T20:32:33.2338897Z 2025-05-07T20:32:33.2339005Z moe/activation_test.py:117: 2025-05-07T20:32:33.2339297Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:33.2339700Z moe/activation_test.py:115: in fn 2025-05-07T20:32:33.2340046Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:33.2340816Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:33.2341497Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:33.2342037Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:33.2342716Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:33.2343367Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:33.2343892Z kernel = self.compile( 2025-05-07T20:32:33.2344429Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:33.2345077Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:33.2345477Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:33.2345703Z 2025-05-07T20:32:33.2345906Z self = 2025-05-07T20:32:33.2347016Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:33.2348382Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fcec1b52d40>} 2025-05-07T20:32:33.2349704Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:33.2350723Z context = 2025-05-07T20:32:33.2351006Z 2025-05-07T20:32:33.2351179Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:33.2351695Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:33.2352154Z module_map=module_map) 2025-05-07T20:32:33.2352521Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:33.2352869Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:33.2353116Z E ^ 2025-05-07T20:32:33.2353574Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:33.2354014Z 2025-05-07T20:32:33.2354431Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:33.2354935Z 2025-05-07T20:32:33.2355042Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:33.2355451Z self=, 2025-05-07T20:32:33.2355847Z T=16384, 2025-05-07T20:32:33.2356036Z D=5120, 2025-05-07T20:32:33.2356221Z scale_ub=1200.0, 2025-05-07T20:32:33.2356437Z contiguous=True, 2025-05-07T20:32:33.2356652Z compiled=True, 2025-05-07T20:32:33.2356843Z ) 2025-05-07T20:32:33.2357158Z self = 2025-05-07T20:32:33.2357708Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:33.2357983Z 2025-05-07T20:32:33.2358063Z @given( 2025-05-07T20:32:33.2358291Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:33.2358608Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:33.2358910Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:33.2359230Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:33.2359559Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:33.2359890Z ) 2025-05-07T20:32:33.2360272Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:33.2360718Z def test_silu_mul_quant( 2025-05-07T20:32:33.2360966Z self, 2025-05-07T20:32:33.2361157Z T: int, 2025-05-07T20:32:33.2361351Z D: int, 2025-05-07T20:32:33.2361567Z scale_ub: Optional[float], 2025-05-07T20:32:33.2361844Z contiguous: bool, 2025-05-07T20:32:33.2362079Z compiled: bool, 2025-05-07T20:32:33.2362297Z ) -> None: 2025-05-07T20:32:33.2362510Z torch.manual_seed(2025) 2025-05-07T20:32:33.2362742Z 2025-05-07T20:32:33.2363009Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:33.2363351Z 2025-05-07T20:32:33.2363536Z x_sign = torch.sign(x) 2025-05-07T20:32:33.2363824Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:33.2364124Z x = x_sign * x_clamp 2025-05-07T20:32:33.2364361Z x0 = x[:, :D] 2025-05-07T20:32:33.2364572Z x1 = x[:, D:] 2025-05-07T20:32:33.2364767Z 2025-05-07T20:32:33.2364946Z if contiguous: 2025-05-07T20:32:33.2365173Z x0 = x0.contiguous() 2025-05-07T20:32:33.2365426Z x1 = x1.contiguous() 2025-05-07T20:32:33.2365656Z 2025-05-07T20:32:33.2365846Z if scale_ub is not None: 2025-05-07T20:32:33.2366120Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:33.2366493Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:33.2366801Z ) 2025-05-07T20:32:33.2366987Z else: 2025-05-07T20:32:33.2367197Z scale_ub_tensor = None 2025-05-07T20:32:33.2367438Z 2025-05-07T20:32:33.2367665Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:33.2367969Z op = silu_mul_quant 2025-05-07T20:32:33.2368215Z if compiled: 2025-05-07T20:32:33.2368460Z op = torch.compile(op) 2025-05-07T20:32:33.2368754Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:33.2369017Z 2025-05-07T20:32:33.2369204Z > y_fp8, y_scale = fn() 2025-05-07T20:32:33.2369364Z 2025-05-07T20:32:33.2369469Z moe/activation_test.py:117: 2025-05-07T20:32:33.2369753Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:33.2370079Z moe/activation_test.py:115: in fn 2025-05-07T20:32:33.2370365Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:33.2370910Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:33.2371458Z return fn(*args, **kwargs) 2025-05-07T20:32:33.2372109Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:33.2372789Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:33.2373312Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:33.2373990Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:33.2374650Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:33.2375176Z kernel = self.compile( 2025-05-07T20:32:33.2375711Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:33.2376412Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:33.2376806Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:33.2377025Z 2025-05-07T20:32:33.2377242Z self = 2025-05-07T20:32:33.2378301Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:33.2379743Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fcec1b52830>} 2025-05-07T20:32:33.2381149Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:33.2382180Z context = 2025-05-07T20:32:33.2382462Z 2025-05-07T20:32:33.2382624Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:33.2383144Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:33.2383603Z module_map=module_map) 2025-05-07T20:32:33.2383963Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:33.2384313Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:33.2384568Z E ^ 2025-05-07T20:32:33.2385027Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:33.2385467Z 2025-05-07T20:32:33.2385877Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:33.2386432Z 2025-05-07T20:32:33.4275870Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:33.4276306Z self=, 2025-05-07T20:32:33.4276716Z T=16384, 2025-05-07T20:32:33.4276945Z D=5120, 2025-05-07T20:32:33.4277160Z scale_ub=None, 2025-05-07T20:32:33.4277382Z contiguous=False, 2025-05-07T20:32:33.4277614Z compiled=True, 2025-05-07T20:32:33.4277821Z ) 2025-05-07T20:32:33.4278151Z self = 2025-05-07T20:32:33.4278660Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:33.4278938Z 2025-05-07T20:32:33.4279028Z @given( 2025-05-07T20:32:33.4279269Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:33.4279585Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:33.4279894Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:33.4280225Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:33.4280564Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:33.4280860Z ) 2025-05-07T20:32:33.4281209Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:33.4281654Z def test_silu_mul_quant( 2025-05-07T20:32:33.4281899Z self, 2025-05-07T20:32:33.4282099Z T: int, 2025-05-07T20:32:33.4282296Z D: int, 2025-05-07T20:32:33.4282523Z scale_ub: Optional[float], 2025-05-07T20:32:33.4282797Z contiguous: bool, 2025-05-07T20:32:33.4283041Z compiled: bool, 2025-05-07T20:32:33.4283274Z ) -> None: 2025-05-07T20:32:33.4283505Z torch.manual_seed(2025) 2025-05-07T20:32:33.4283744Z 2025-05-07T20:32:33.4284023Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:33.4284374Z 2025-05-07T20:32:33.4284571Z x_sign = torch.sign(x) 2025-05-07T20:32:33.4284990Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:33.4285305Z x = x_sign * x_clamp 2025-05-07T20:32:33.4285539Z x0 = x[:, :D] 2025-05-07T20:32:33.4285758Z x1 = x[:, D:] 2025-05-07T20:32:33.4285963Z 2025-05-07T20:32:33.4286139Z if contiguous: 2025-05-07T20:32:33.4286366Z x0 = x0.contiguous() 2025-05-07T20:32:33.4286625Z x1 = x1.contiguous() 2025-05-07T20:32:33.4286853Z 2025-05-07T20:32:33.4287057Z if scale_ub is not None: 2025-05-07T20:32:33.4287328Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:33.4287734Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:33.4288033Z ) 2025-05-07T20:32:33.4288287Z else: 2025-05-07T20:32:33.4288504Z scale_ub_tensor = None 2025-05-07T20:32:33.4288752Z 2025-05-07T20:32:33.4288985Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:33.4289293Z op = silu_mul_quant 2025-05-07T20:32:33.4289543Z if compiled: 2025-05-07T20:32:33.4289797Z op = torch.compile(op) 2025-05-07T20:32:33.4290250Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:33.4290518Z 2025-05-07T20:32:33.4290709Z > y_fp8, y_scale = fn() 2025-05-07T20:32:33.4290873Z 2025-05-07T20:32:33.4290976Z moe/activation_test.py:117: 2025-05-07T20:32:33.4291270Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:33.4291604Z moe/activation_test.py:115: in fn 2025-05-07T20:32:33.4291887Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:33.4292458Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:33.4293012Z return fn(*args, **kwargs) 2025-05-07T20:32:33.4293676Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:33.4294369Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:33.4294993Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:33.4295671Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:33.4296338Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:33.4296866Z kernel = self.compile( 2025-05-07T20:32:33.4297400Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:33.4298058Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:33.4298456Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:33.4298682Z 2025-05-07T20:32:33.4298901Z self = 2025-05-07T20:32:33.4300095Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:33.4301469Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fcec1b53760>} 2025-05-07T20:32:33.4302801Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:33.4303828Z context = 2025-05-07T20:32:33.4304125Z 2025-05-07T20:32:33.4304295Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:33.4304808Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:33.4305341Z module_map=module_map) 2025-05-07T20:32:33.4305708Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:33.4306052Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:33.4306313Z E ^ 2025-05-07T20:32:33.4312650Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:33.4313102Z 2025-05-07T20:32:33.4313523Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:33.4314159Z 2025-05-07T20:32:33.4314269Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:33.4314743Z self=, 2025-05-07T20:32:33.4315159Z T=2048, 2025-05-07T20:32:33.4315350Z D=5120, 2025-05-07T20:32:33.4315545Z scale_ub=None, 2025-05-07T20:32:33.4315768Z contiguous=False, 2025-05-07T20:32:33.4315996Z compiled=True, 2025-05-07T20:32:33.4316203Z ) 2025-05-07T20:32:33.5344990Z self = 2025-05-07T20:32:33.5345554Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:33.5345839Z 2025-05-07T20:32:33.5345921Z @given( 2025-05-07T20:32:33.5346163Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:33.5346479Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:33.5346790Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:33.5347126Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:33.5347465Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:33.5347754Z ) 2025-05-07T20:32:33.5348124Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:33.5348565Z def test_silu_mul_quant( 2025-05-07T20:32:33.5348809Z self, 2025-05-07T20:32:33.5349015Z T: int, 2025-05-07T20:32:33.5349214Z D: int, 2025-05-07T20:32:33.5349560Z scale_ub: Optional[float], 2025-05-07T20:32:33.5349854Z contiguous: bool, 2025-05-07T20:32:33.5350091Z compiled: bool, 2025-05-07T20:32:33.5350332Z ) -> None: 2025-05-07T20:32:33.5350569Z torch.manual_seed(2025) 2025-05-07T20:32:33.5350824Z 2025-05-07T20:32:33.5351093Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:33.5351447Z 2025-05-07T20:32:33.5351648Z x_sign = torch.sign(x) 2025-05-07T20:32:33.5351942Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:33.5352262Z x = x_sign * x_clamp 2025-05-07T20:32:33.5352504Z x0 = x[:, :D] 2025-05-07T20:32:33.5352720Z x1 = x[:, D:] 2025-05-07T20:32:33.5352933Z 2025-05-07T20:32:33.5353130Z if contiguous: 2025-05-07T20:32:33.5353363Z x0 = x0.contiguous() 2025-05-07T20:32:33.5353624Z x1 = x1.contiguous() 2025-05-07T20:32:33.5353865Z 2025-05-07T20:32:33.5354061Z if scale_ub is not None: 2025-05-07T20:32:33.5354347Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:33.5354697Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:33.5355004Z ) 2025-05-07T20:32:33.5355209Z else: 2025-05-07T20:32:33.5355433Z scale_ub_tensor = None 2025-05-07T20:32:33.5355689Z 2025-05-07T20:32:33.5355925Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:33.5356247Z op = silu_mul_quant 2025-05-07T20:32:33.5356504Z if compiled: 2025-05-07T20:32:33.5356759Z op = torch.compile(op) 2025-05-07T20:32:33.5357060Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:33.5357344Z 2025-05-07T20:32:33.5357539Z > y_fp8, y_scale = fn() 2025-05-07T20:32:33.5357721Z 2025-05-07T20:32:33.5357825Z moe/activation_test.py:117: 2025-05-07T20:32:33.5358121Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:33.5358523Z moe/activation_test.py:115: in fn 2025-05-07T20:32:33.5358813Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:33.5359377Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:33.5359943Z return fn(*args, **kwargs) 2025-05-07T20:32:33.5360600Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:33.5361289Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:33.5361892Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:33.5362621Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:33.5363291Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:33.5363829Z kernel = self.compile( 2025-05-07T20:32:33.5364373Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:33.5365025Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:33.5365423Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:33.5365647Z 2025-05-07T20:32:33.5365862Z self = 2025-05-07T20:32:33.5366941Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:33.5368301Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fcec164c3a0>} 2025-05-07T20:32:33.5369681Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:33.5370727Z context = 2025-05-07T20:32:33.5371017Z 2025-05-07T20:32:33.5371188Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:33.5371715Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:33.5372185Z module_map=module_map) 2025-05-07T20:32:33.5372553Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:33.5372913Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:33.5373163Z E ^ 2025-05-07T20:32:33.5373627Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:33.5374073Z 2025-05-07T20:32:33.5374496Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:33.5375006Z 2025-05-07T20:32:33.5375115Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:33.5375683Z self=, 2025-05-07T20:32:33.5376080Z T=2048, 2025-05-07T20:32:33.5376269Z D=5120, 2025-05-07T20:32:33.5376455Z scale_ub=1200.0, 2025-05-07T20:32:33.5376679Z contiguous=False, 2025-05-07T20:32:33.5376897Z compiled=True, 2025-05-07T20:32:33.5377096Z ) 2025-05-07T20:32:33.5377412Z self = 2025-05-07T20:32:33.5377909Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:33.5378174Z 2025-05-07T20:32:33.5378251Z @given( 2025-05-07T20:32:33.5378479Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:33.5378788Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:33.5379147Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:33.5379471Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:33.5379884Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:33.5380171Z ) 2025-05-07T20:32:33.5380512Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:33.5380948Z def test_silu_mul_quant( 2025-05-07T20:32:33.5381188Z self, 2025-05-07T20:32:33.5381380Z T: int, 2025-05-07T20:32:33.5381583Z D: int, 2025-05-07T20:32:33.5381847Z scale_ub: Optional[float], 2025-05-07T20:32:33.5382113Z contiguous: bool, 2025-05-07T20:32:33.5382354Z compiled: bool, 2025-05-07T20:32:33.5382616Z ) -> None: 2025-05-07T20:32:33.5382824Z torch.manual_seed(2025) 2025-05-07T20:32:33.5383057Z 2025-05-07T20:32:33.5383328Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:33.5383664Z 2025-05-07T20:32:33.5383849Z x_sign = torch.sign(x) 2025-05-07T20:32:33.5384139Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:33.5384451Z x = x_sign * x_clamp 2025-05-07T20:32:33.5384685Z x0 = x[:, :D] 2025-05-07T20:32:33.5384901Z x1 = x[:, D:] 2025-05-07T20:32:33.5385105Z 2025-05-07T20:32:33.5385285Z if contiguous: 2025-05-07T20:32:33.5385512Z x0 = x0.contiguous() 2025-05-07T20:32:33.5385772Z x1 = x1.contiguous() 2025-05-07T20:32:33.5386001Z 2025-05-07T20:32:33.5386193Z if scale_ub is not None: 2025-05-07T20:32:33.5386458Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:33.5386791Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:33.5387095Z ) 2025-05-07T20:32:33.5387282Z else: 2025-05-07T20:32:33.5387488Z scale_ub_tensor = None 2025-05-07T20:32:33.5387740Z 2025-05-07T20:32:33.5387968Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:33.5388322Z op = silu_mul_quant 2025-05-07T20:32:33.5388577Z if compiled: 2025-05-07T20:32:33.5388826Z op = torch.compile(op) 2025-05-07T20:32:33.5389118Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:33.5389381Z 2025-05-07T20:32:33.5389572Z > y_fp8, y_scale = fn() 2025-05-07T20:32:33.5389734Z 2025-05-07T20:32:33.5390043Z moe/activation_test.py:117: 2025-05-07T20:32:33.5390409Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:33.5390742Z moe/activation_test.py:115: in fn 2025-05-07T20:32:33.5391026Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:33.5391570Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:33.5393605Z return fn(*args, **kwargs) 2025-05-07T20:32:33.5394255Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:33.5394953Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:33.5395475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:33.5396159Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:33.5396814Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:33.5397342Z kernel = self.compile( 2025-05-07T20:32:33.5397871Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:33.5398520Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:33.5398911Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:33.5399133Z 2025-05-07T20:32:33.5399352Z self = 2025-05-07T20:32:33.5400500Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:33.5401846Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fcec164c820>} 2025-05-07T20:32:33.5403173Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:33.5404309Z context = 2025-05-07T20:32:33.5404593Z 2025-05-07T20:32:33.5404757Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:33.5405273Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:33.5405731Z module_map=module_map) 2025-05-07T20:32:33.5406091Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:33.5406437Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:33.5406688Z E ^ 2025-05-07T20:32:33.5407151Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:33.5407596Z 2025-05-07T20:32:33.5408005Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:33.5408516Z 2025-05-07T20:32:33.9083613Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:33.9084044Z self=, 2025-05-07T20:32:33.9084479Z T=4096, 2025-05-07T20:32:33.9084676Z D=5120, 2025-05-07T20:32:33.9084899Z scale_ub=1200.0, 2025-05-07T20:32:33.9085131Z contiguous=True, 2025-05-07T20:32:33.9085500Z compiled=True, 2025-05-07T20:32:33.9085714Z ) 2025-05-07T20:32:33.9086031Z self = 2025-05-07T20:32:33.9086519Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:33.9086785Z 2025-05-07T20:32:33.9086873Z @given( 2025-05-07T20:32:33.9087105Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:33.9087416Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:33.9087725Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:33.9088059Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:33.9088384Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:33.9088657Z ) 2025-05-07T20:32:33.9089006Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:33.9089441Z def test_silu_mul_quant( 2025-05-07T20:32:33.9089684Z self, 2025-05-07T20:32:33.9090037Z T: int, 2025-05-07T20:32:33.9090240Z D: int, 2025-05-07T20:32:33.9090473Z scale_ub: Optional[float], 2025-05-07T20:32:33.9090770Z contiguous: bool, 2025-05-07T20:32:33.9091028Z compiled: bool, 2025-05-07T20:32:33.9091251Z ) -> None: 2025-05-07T20:32:33.9091466Z torch.manual_seed(2025) 2025-05-07T20:32:33.9091701Z 2025-05-07T20:32:33.9091975Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:33.9092322Z 2025-05-07T20:32:33.9092513Z x_sign = torch.sign(x) 2025-05-07T20:32:33.9092813Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:33.9093122Z x = x_sign * x_clamp 2025-05-07T20:32:33.9093371Z x0 = x[:, :D] 2025-05-07T20:32:33.9093590Z x1 = x[:, D:] 2025-05-07T20:32:33.9093793Z 2025-05-07T20:32:33.9093976Z if contiguous: 2025-05-07T20:32:33.9094208Z x0 = x0.contiguous() 2025-05-07T20:32:33.9094461Z x1 = x1.contiguous() 2025-05-07T20:32:33.9094780Z 2025-05-07T20:32:33.9094971Z if scale_ub is not None: 2025-05-07T20:32:33.9095246Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:33.9095582Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:33.9095881Z ) 2025-05-07T20:32:33.9096070Z else: 2025-05-07T20:32:33.9096282Z scale_ub_tensor = None 2025-05-07T20:32:33.9096526Z 2025-05-07T20:32:33.9096761Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:33.9097082Z op = silu_mul_quant 2025-05-07T20:32:33.9097456Z if compiled: 2025-05-07T20:32:33.9097709Z op = torch.compile(op) 2025-05-07T20:32:33.9098062Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:33.9098345Z 2025-05-07T20:32:33.9098532Z > y_fp8, y_scale = fn() 2025-05-07T20:32:33.9098702Z 2025-05-07T20:32:33.9098807Z moe/activation_test.py:117: 2025-05-07T20:32:33.9099111Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:33.9099444Z moe/activation_test.py:115: in fn 2025-05-07T20:32:33.9099730Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:33.9100365Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:33.9100915Z return fn(*args, **kwargs) 2025-05-07T20:32:33.9101575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:33.9102277Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:33.9102811Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:33.9103478Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:33.9104139Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:33.9104731Z kernel = self.compile( 2025-05-07T20:32:33.9105278Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:33.9105920Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:33.9106314Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:33.9106536Z 2025-05-07T20:32:33.9106746Z self = 2025-05-07T20:32:33.9107810Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:33.9109164Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fcec164d360>} 2025-05-07T20:32:33.9110501Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:33.9111507Z context = 2025-05-07T20:32:33.9111789Z 2025-05-07T20:32:33.9111954Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:33.9112467Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:33.9112932Z module_map=module_map) 2025-05-07T20:32:33.9113296Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:33.9113640Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:33.9113895Z E ^ 2025-05-07T20:32:33.9114359Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:33.9114851Z 2025-05-07T20:32:33.9115269Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:33.9115772Z 2025-05-07T20:32:33.9115875Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:33.9116280Z self=, 2025-05-07T20:32:33.9116673Z T=128, 2025-05-07T20:32:33.9116852Z D=5120, 2025-05-07T20:32:33.9117039Z scale_ub=1200.0, 2025-05-07T20:32:33.9117261Z contiguous=False, 2025-05-07T20:32:33.9117536Z compiled=True, 2025-05-07T20:32:33.9117731Z ) 2025-05-07T20:32:34.0277201Z self = 2025-05-07T20:32:34.0277867Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:34.0278149Z 2025-05-07T20:32:34.0278227Z @given( 2025-05-07T20:32:34.0278465Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:34.0278773Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:34.0279086Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:34.0279423Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:34.0279742Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:34.0280024Z ) 2025-05-07T20:32:34.0280380Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:34.0280820Z def test_silu_mul_quant( 2025-05-07T20:32:34.0281058Z self, 2025-05-07T20:32:34.0281256Z T: int, 2025-05-07T20:32:34.0281464Z D: int, 2025-05-07T20:32:34.0281685Z scale_ub: Optional[float], 2025-05-07T20:32:34.0281958Z contiguous: bool, 2025-05-07T20:32:34.0282202Z compiled: bool, 2025-05-07T20:32:34.0282417Z ) -> None: 2025-05-07T20:32:34.0282625Z torch.manual_seed(2025) 2025-05-07T20:32:34.0282864Z 2025-05-07T20:32:34.0283126Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:34.0283537Z 2025-05-07T20:32:34.0283736Z x_sign = torch.sign(x) 2025-05-07T20:32:34.0284029Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:34.0284341Z x = x_sign * x_clamp 2025-05-07T20:32:34.0284593Z x0 = x[:, :D] 2025-05-07T20:32:34.0284806Z x1 = x[:, D:] 2025-05-07T20:32:34.0285011Z 2025-05-07T20:32:34.0285192Z if contiguous: 2025-05-07T20:32:34.0285414Z x0 = x0.contiguous() 2025-05-07T20:32:34.0285664Z x1 = x1.contiguous() 2025-05-07T20:32:34.0285903Z 2025-05-07T20:32:34.0286090Z if scale_ub is not None: 2025-05-07T20:32:34.0286368Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:34.0286704Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:34.0287012Z ) 2025-05-07T20:32:34.0287201Z else: 2025-05-07T20:32:34.0287421Z scale_ub_tensor = None 2025-05-07T20:32:34.0287674Z 2025-05-07T20:32:34.0287909Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:34.0288225Z op = silu_mul_quant 2025-05-07T20:32:34.0288481Z if compiled: 2025-05-07T20:32:34.0288729Z op = torch.compile(op) 2025-05-07T20:32:34.0289030Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:34.0289309Z 2025-05-07T20:32:34.0289498Z > y_fp8, y_scale = fn() 2025-05-07T20:32:34.0289666Z 2025-05-07T20:32:34.0289766Z moe/activation_test.py:117: 2025-05-07T20:32:34.0290315Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:34.0290660Z moe/activation_test.py:115: in fn 2025-05-07T20:32:34.0290946Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:34.0291512Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:34.0292077Z return fn(*args, **kwargs) 2025-05-07T20:32:34.0292735Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:34.0293524Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:34.0294062Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:34.0294748Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:34.0295408Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:34.0295938Z kernel = self.compile( 2025-05-07T20:32:34.0296611Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:34.0297264Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:34.0297822Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:34.0298131Z 2025-05-07T20:32:34.0298362Z self = 2025-05-07T20:32:34.0299442Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:34.0300922Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fcec164e290>} 2025-05-07T20:32:34.0302274Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:34.0303303Z context = 2025-05-07T20:32:34.0303590Z 2025-05-07T20:32:34.0303768Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:34.0304384Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:34.0304857Z module_map=module_map) 2025-05-07T20:32:34.0305233Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:34.0305588Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:34.0305846Z E ^ 2025-05-07T20:32:34.0306315Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:34.0306759Z 2025-05-07T20:32:34.0307184Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:34.0307706Z 2025-05-07T20:32:34.0307821Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:34.0308241Z self=, 2025-05-07T20:32:34.0308644Z T=16384, 2025-05-07T20:32:34.0308840Z D=7168, 2025-05-07T20:32:34.0309043Z scale_ub=1200.0, 2025-05-07T20:32:34.0309277Z contiguous=True, 2025-05-07T20:32:34.0309510Z compiled=True, 2025-05-07T20:32:34.0309715Z ) 2025-05-07T20:32:34.0310033Z self = 2025-05-07T20:32:34.0310527Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:34.0310801Z 2025-05-07T20:32:34.0310886Z @given( 2025-05-07T20:32:34.0311119Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:34.0311440Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:34.0311755Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:34.0312086Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:34.0312415Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:34.0312706Z ) 2025-05-07T20:32:34.0313055Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:34.0313497Z def test_silu_mul_quant( 2025-05-07T20:32:34.0313798Z self, 2025-05-07T20:32:34.0313993Z T: int, 2025-05-07T20:32:34.0314198Z D: int, 2025-05-07T20:32:34.0314422Z scale_ub: Optional[float], 2025-05-07T20:32:34.0314700Z contiguous: bool, 2025-05-07T20:32:34.0314940Z compiled: bool, 2025-05-07T20:32:34.0315172Z ) -> None: 2025-05-07T20:32:34.0315394Z torch.manual_seed(2025) 2025-05-07T20:32:34.0315636Z 2025-05-07T20:32:34.0315919Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:34.0316316Z 2025-05-07T20:32:34.0316511Z x_sign = torch.sign(x) 2025-05-07T20:32:34.0316807Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:34.0317187Z x = x_sign * x_clamp 2025-05-07T20:32:34.0317430Z x0 = x[:, :D] 2025-05-07T20:32:34.0323719Z x1 = x[:, D:] 2025-05-07T20:32:34.0323945Z 2025-05-07T20:32:34.0324146Z if contiguous: 2025-05-07T20:32:34.0324403Z x0 = x0.contiguous() 2025-05-07T20:32:34.0324673Z x1 = x1.contiguous() 2025-05-07T20:32:34.0324927Z 2025-05-07T20:32:34.0325126Z if scale_ub is not None: 2025-05-07T20:32:34.0325409Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:34.0325748Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:34.0326059Z ) 2025-05-07T20:32:34.0326257Z else: 2025-05-07T20:32:34.0326471Z scale_ub_tensor = None 2025-05-07T20:32:34.0326737Z 2025-05-07T20:32:34.0326980Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:34.0327303Z op = silu_mul_quant 2025-05-07T20:32:34.0327565Z if compiled: 2025-05-07T20:32:34.0327827Z op = torch.compile(op) 2025-05-07T20:32:34.0328127Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:34.0328406Z 2025-05-07T20:32:34.0328607Z > y_fp8, y_scale = fn() 2025-05-07T20:32:34.0328776Z 2025-05-07T20:32:34.0328882Z moe/activation_test.py:117: 2025-05-07T20:32:34.0329261Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:34.0329601Z moe/activation_test.py:115: in fn 2025-05-07T20:32:34.0329890Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:34.0330461Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:34.0331027Z return fn(*args, **kwargs) 2025-05-07T20:32:34.0331687Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:34.0332371Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:34.0332921Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:34.0333610Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:34.0334285Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:34.0334815Z kernel = self.compile( 2025-05-07T20:32:34.0335363Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:34.0336021Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:34.0336423Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:34.0336651Z 2025-05-07T20:32:34.0336866Z self = 2025-05-07T20:32:34.0337945Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:34.0339331Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fcec164ed40>} 2025-05-07T20:32:34.0340800Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:34.0341835Z context = 2025-05-07T20:32:34.0342128Z 2025-05-07T20:32:34.0342298Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:34.0342816Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:34.0343367Z module_map=module_map) 2025-05-07T20:32:34.0343733Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:34.0344103Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:34.0344371Z E ^ 2025-05-07T20:32:34.0344847Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:34.0345302Z 2025-05-07T20:32:34.0345718Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:34.0346231Z 2025-05-07T20:32:34.1708662Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:34.1709127Z self=, 2025-05-07T20:32:34.1709547Z T=16384, 2025-05-07T20:32:34.1709748Z D=5120, 2025-05-07T20:32:34.1709946Z scale_ub=1200.0, 2025-05-07T20:32:34.1710184Z contiguous=True, 2025-05-07T20:32:34.1710414Z compiled=False, 2025-05-07T20:32:34.1710654Z ) 2025-05-07T20:32:34.1710992Z self = 2025-05-07T20:32:34.1711498Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:34.1711773Z 2025-05-07T20:32:34.1711851Z @given( 2025-05-07T20:32:34.1712201Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:34.1712519Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:34.1712827Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:34.1713154Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:34.1713488Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:34.1713769Z ) 2025-05-07T20:32:34.1714126Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:34.1714578Z def test_silu_mul_quant( 2025-05-07T20:32:34.1714820Z self, 2025-05-07T20:32:34.1715012Z T: int, 2025-05-07T20:32:34.1715217Z D: int, 2025-05-07T20:32:34.1715445Z scale_ub: Optional[float], 2025-05-07T20:32:34.1715711Z contiguous: bool, 2025-05-07T20:32:34.1715953Z compiled: bool, 2025-05-07T20:32:34.1716180Z ) -> None: 2025-05-07T20:32:34.1716392Z torch.manual_seed(2025) 2025-05-07T20:32:34.1716628Z 2025-05-07T20:32:34.1716914Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:34.1717287Z 2025-05-07T20:32:34.1717485Z x_sign = torch.sign(x) 2025-05-07T20:32:34.1717770Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:34.1718080Z x = x_sign * x_clamp 2025-05-07T20:32:34.1718326Z x0 = x[:, :D] 2025-05-07T20:32:34.1718538Z x1 = x[:, D:] 2025-05-07T20:32:34.1718751Z 2025-05-07T20:32:34.1718942Z if contiguous: 2025-05-07T20:32:34.1719175Z x0 = x0.contiguous() 2025-05-07T20:32:34.1719446Z x1 = x1.contiguous() 2025-05-07T20:32:34.1719687Z 2025-05-07T20:32:34.1719875Z if scale_ub is not None: 2025-05-07T20:32:34.1720162Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:34.1720499Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:34.1720798Z ) 2025-05-07T20:32:34.1720994Z else: 2025-05-07T20:32:34.1721210Z scale_ub_tensor = None 2025-05-07T20:32:34.1721534Z 2025-05-07T20:32:34.1721758Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:34.1722083Z op = silu_mul_quant 2025-05-07T20:32:34.1722335Z if compiled: 2025-05-07T20:32:34.1722589Z op = torch.compile(op) 2025-05-07T20:32:34.1722886Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:34.1723160Z 2025-05-07T20:32:34.1723348Z > y_fp8, y_scale = fn() 2025-05-07T20:32:34.1723521Z 2025-05-07T20:32:34.1723622Z moe/activation_test.py:117: 2025-05-07T20:32:34.1723988Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:34.1724367Z moe/activation_test.py:115: in fn 2025-05-07T20:32:34.1724652Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:34.1725351Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:34.1726044Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:34.1726570Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:34.1727261Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:34.1727924Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:34.1728448Z kernel = self.compile( 2025-05-07T20:32:34.1728990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:34.1729650Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:34.1730049Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:34.1730277Z 2025-05-07T20:32:34.1730486Z self = 2025-05-07T20:32:34.1731612Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:34.1732979Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fcec164fac0>} 2025-05-07T20:32:34.1734308Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:34.1735343Z context = 2025-05-07T20:32:34.1735626Z 2025-05-07T20:32:34.1735793Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:34.1736315Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:34.1736784Z module_map=module_map) 2025-05-07T20:32:34.1737147Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:34.1737496Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:34.1737756Z E ^ 2025-05-07T20:32:34.1738224Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:34.1738668Z 2025-05-07T20:32:34.1739089Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:34.1739603Z 2025-05-07T20:32:34.1739709Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:34.1740199Z self=, 2025-05-07T20:32:34.1740622Z T=1, 2025-05-07T20:32:34.1740826Z D=7168, 2025-05-07T20:32:34.1741024Z scale_ub=1200.0, 2025-05-07T20:32:34.1741257Z contiguous=False, 2025-05-07T20:32:34.1741477Z compiled=False, 2025-05-07T20:32:34.1741727Z ) 2025-05-07T20:32:34.1742050Z self = 2025-05-07T20:32:34.1742527Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:34.1742794Z 2025-05-07T20:32:34.1742872Z @given( 2025-05-07T20:32:34.1743101Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:34.1743404Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:34.1743709Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:34.1744032Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:34.1744408Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:34.1744686Z ) 2025-05-07T20:32:34.1745077Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:34.1745520Z def test_silu_mul_quant( 2025-05-07T20:32:34.1745754Z self, 2025-05-07T20:32:34.1745947Z T: int, 2025-05-07T20:32:34.1746142Z D: int, 2025-05-07T20:32:34.1746362Z scale_ub: Optional[float], 2025-05-07T20:32:34.1746635Z contiguous: bool, 2025-05-07T20:32:34.1746873Z compiled: bool, 2025-05-07T20:32:34.1747090Z ) -> None: 2025-05-07T20:32:34.1747307Z torch.manual_seed(2025) 2025-05-07T20:32:34.1747546Z 2025-05-07T20:32:34.1747810Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:34.1748146Z 2025-05-07T20:32:34.1748343Z x_sign = torch.sign(x) 2025-05-07T20:32:34.1748627Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:34.1748933Z x = x_sign * x_clamp 2025-05-07T20:32:34.1749169Z x0 = x[:, :D] 2025-05-07T20:32:34.1749385Z x1 = x[:, D:] 2025-05-07T20:32:34.1749591Z 2025-05-07T20:32:34.1749775Z if contiguous: 2025-05-07T20:32:34.1750007Z x0 = x0.contiguous() 2025-05-07T20:32:34.1750265Z x1 = x1.contiguous() 2025-05-07T20:32:34.1750505Z 2025-05-07T20:32:34.1750745Z if scale_ub is not None: 2025-05-07T20:32:34.1751017Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:34.1751347Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:34.1751654Z ) 2025-05-07T20:32:34.1751841Z else: 2025-05-07T20:32:34.1752056Z scale_ub_tensor = None 2025-05-07T20:32:34.1752307Z 2025-05-07T20:32:34.1752548Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:34.1752857Z op = silu_mul_quant 2025-05-07T20:32:34.1753112Z if compiled: 2025-05-07T20:32:34.1753367Z op = torch.compile(op) 2025-05-07T20:32:34.1753660Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:34.1753930Z 2025-05-07T20:32:34.1754128Z > y_fp8, y_scale = fn() 2025-05-07T20:32:34.1754292Z 2025-05-07T20:32:34.1754393Z moe/activation_test.py:117: 2025-05-07T20:32:34.1754688Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:34.1755017Z moe/activation_test.py:115: in fn 2025-05-07T20:32:34.1755297Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:34.1755967Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:34.1756649Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:34.1757184Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:34.1757852Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:34.1758519Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:34.1759046Z kernel = self.compile( 2025-05-07T20:32:34.1759579Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:34.1760230Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:34.1760680Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:34.1760902Z 2025-05-07T20:32:34.1761112Z self = 2025-05-07T20:32:34.1762174Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:34.1763646Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fcec13a44c0>} 2025-05-07T20:32:34.1764984Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:34.1766018Z context = 2025-05-07T20:32:34.1766303Z 2025-05-07T20:32:34.1766475Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:34.1766984Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:34.1767445Z module_map=module_map) 2025-05-07T20:32:34.1767808Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:34.1768157Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:34.1768409Z E ^ 2025-05-07T20:32:34.1768872Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:34.1769313Z 2025-05-07T20:32:34.1769727Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:34.1770240Z 2025-05-07T20:32:34.3695502Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:34.3696109Z self=, 2025-05-07T20:32:34.3696504Z T=4096, 2025-05-07T20:32:34.3696700Z D=7168, 2025-05-07T20:32:34.3696897Z scale_ub=1200.0, 2025-05-07T20:32:34.3697118Z contiguous=False, 2025-05-07T20:32:34.3697346Z compiled=True, 2025-05-07T20:32:34.3697554Z ) 2025-05-07T20:32:34.3697870Z self = 2025-05-07T20:32:34.3698378Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:34.3698659Z 2025-05-07T20:32:34.3698745Z @given( 2025-05-07T20:32:34.3698977Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:34.3699292Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:34.3699603Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:34.3700030Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:34.3700363Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:34.3700685Z ) 2025-05-07T20:32:34.3701063Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:34.3701500Z def test_silu_mul_quant( 2025-05-07T20:32:34.3701749Z self, 2025-05-07T20:32:34.3701945Z T: int, 2025-05-07T20:32:34.3702145Z D: int, 2025-05-07T20:32:34.3702375Z scale_ub: Optional[float], 2025-05-07T20:32:34.3702651Z contiguous: bool, 2025-05-07T20:32:34.3702892Z compiled: bool, 2025-05-07T20:32:34.3703120Z ) -> None: 2025-05-07T20:32:34.3703343Z torch.manual_seed(2025) 2025-05-07T20:32:34.3703587Z 2025-05-07T20:32:34.3703873Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:34.3704216Z 2025-05-07T20:32:34.3704416Z x_sign = torch.sign(x) 2025-05-07T20:32:34.3704701Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:34.3705015Z x = x_sign * x_clamp 2025-05-07T20:32:34.3705325Z x0 = x[:, :D] 2025-05-07T20:32:34.3705540Z x1 = x[:, D:] 2025-05-07T20:32:34.3705748Z 2025-05-07T20:32:34.3705937Z if contiguous: 2025-05-07T20:32:34.3706169Z x0 = x0.contiguous() 2025-05-07T20:32:34.3706430Z x1 = x1.contiguous() 2025-05-07T20:32:34.3706671Z 2025-05-07T20:32:34.3706863Z if scale_ub is not None: 2025-05-07T20:32:34.3707143Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:34.3707481Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:34.3707851Z ) 2025-05-07T20:32:34.3708044Z else: 2025-05-07T20:32:34.3708256Z scale_ub_tensor = None 2025-05-07T20:32:34.3708554Z 2025-05-07T20:32:34.3708788Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:34.3709102Z op = silu_mul_quant 2025-05-07T20:32:34.3709347Z if compiled: 2025-05-07T20:32:34.3709604Z op = torch.compile(op) 2025-05-07T20:32:34.3709906Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:34.3710172Z 2025-05-07T20:32:34.3710356Z > y_fp8, y_scale = fn() 2025-05-07T20:32:34.3710530Z 2025-05-07T20:32:34.3710633Z moe/activation_test.py:117: 2025-05-07T20:32:34.3710932Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:34.3711254Z moe/activation_test.py:115: in fn 2025-05-07T20:32:34.3711536Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:34.3712098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:34.3712652Z return fn(*args, **kwargs) 2025-05-07T20:32:34.3713308Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:34.3713996Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:34.3714528Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:34.3715251Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:34.3715912Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:34.3716442Z kernel = self.compile( 2025-05-07T20:32:34.3716984Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:34.3717627Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:34.3718027Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:34.3718249Z 2025-05-07T20:32:34.3718461Z self = 2025-05-07T20:32:34.3719527Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:34.3720894Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fcec13a51b0>} 2025-05-07T20:32:34.3722222Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:34.3723469Z context = 2025-05-07T20:32:34.3723819Z 2025-05-07T20:32:34.3723998Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:34.3724513Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:34.3724982Z module_map=module_map) 2025-05-07T20:32:34.3725353Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:34.3725774Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:34.3726038Z E ^ 2025-05-07T20:32:34.3726500Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:34.3726939Z 2025-05-07T20:32:34.3727354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:34.3727867Z 2025-05-07T20:32:34.3727974Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:34.3728428Z self=, 2025-05-07T20:32:34.3728824Z T=128, 2025-05-07T20:32:34.3729049Z D=7168, 2025-05-07T20:32:34.3729239Z scale_ub=1200.0, 2025-05-07T20:32:34.3729469Z contiguous=False, 2025-05-07T20:32:34.3729695Z compiled=True, 2025-05-07T20:32:34.3729898Z ) 2025-05-07T20:32:34.4767869Z self = 2025-05-07T20:32:34.4768455Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:34.4768725Z 2025-05-07T20:32:34.4768814Z @given( 2025-05-07T20:32:34.4769045Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:34.4769363Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:34.4769674Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:34.4770010Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:34.4770339Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:34.4770638Z ) 2025-05-07T20:32:34.4770998Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:34.4771437Z def test_silu_mul_quant( 2025-05-07T20:32:34.4771684Z self, 2025-05-07T20:32:34.4771888Z T: int, 2025-05-07T20:32:34.4772084Z D: int, 2025-05-07T20:32:34.4772311Z scale_ub: Optional[float], 2025-05-07T20:32:34.4772590Z contiguous: bool, 2025-05-07T20:32:34.4772953Z compiled: bool, 2025-05-07T20:32:34.4773186Z ) -> None: 2025-05-07T20:32:34.4773407Z torch.manual_seed(2025) 2025-05-07T20:32:34.4773652Z 2025-05-07T20:32:34.4773942Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:34.4774286Z 2025-05-07T20:32:34.4774483Z x_sign = torch.sign(x) 2025-05-07T20:32:34.4774771Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:34.4775081Z x = x_sign * x_clamp 2025-05-07T20:32:34.4775321Z x0 = x[:, :D] 2025-05-07T20:32:34.4775533Z x1 = x[:, D:] 2025-05-07T20:32:34.4775743Z 2025-05-07T20:32:34.4775935Z if contiguous: 2025-05-07T20:32:34.4776169Z x0 = x0.contiguous() 2025-05-07T20:32:34.4776433Z x1 = x1.contiguous() 2025-05-07T20:32:34.4776675Z 2025-05-07T20:32:34.4776862Z if scale_ub is not None: 2025-05-07T20:32:34.4777138Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:34.4777491Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:34.4777793Z ) 2025-05-07T20:32:34.4777982Z else: 2025-05-07T20:32:34.4778194Z scale_ub_tensor = None 2025-05-07T20:32:34.4778433Z 2025-05-07T20:32:34.4778671Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:34.4778983Z op = silu_mul_quant 2025-05-07T20:32:34.4779232Z if compiled: 2025-05-07T20:32:34.4779487Z op = torch.compile(op) 2025-05-07T20:32:34.4779852Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:34.4780134Z 2025-05-07T20:32:34.4780324Z > y_fp8, y_scale = fn() 2025-05-07T20:32:34.4780496Z 2025-05-07T20:32:34.4780600Z moe/activation_test.py:117: 2025-05-07T20:32:34.4780899Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:34.4781229Z moe/activation_test.py:115: in fn 2025-05-07T20:32:34.4781512Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:34.4782150Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:34.4782705Z return fn(*args, **kwargs) 2025-05-07T20:32:34.4783365Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:34.4784072Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:34.4784611Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:34.4785357Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:34.4786073Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:34.4786616Z kernel = self.compile( 2025-05-07T20:32:34.4787157Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:34.4787824Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:34.4788229Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:34.4788453Z 2025-05-07T20:32:34.4788669Z self = 2025-05-07T20:32:34.4789749Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:34.4791455Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fcec13a40d0>} 2025-05-07T20:32:34.4792912Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:34.4793969Z context = 2025-05-07T20:32:34.4794261Z 2025-05-07T20:32:34.4794435Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:34.4794964Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:34.4795447Z module_map=module_map) 2025-05-07T20:32:34.4803739Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:34.4804120Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:34.4804387Z E ^ 2025-05-07T20:32:34.4804861Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:34.4805310Z 2025-05-07T20:32:34.4805734Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:34.4806257Z 2025-05-07T20:32:34.4806374Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:34.4806788Z self=, 2025-05-07T20:32:34.4807202Z T=2048, 2025-05-07T20:32:34.4807404Z D=7168, 2025-05-07T20:32:34.4807605Z scale_ub=None, 2025-05-07T20:32:34.4807831Z contiguous=True, 2025-05-07T20:32:34.4808068Z compiled=True, 2025-05-07T20:32:34.4808278Z ) 2025-05-07T20:32:34.4808614Z self = 2025-05-07T20:32:34.4809118Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:34.4809384Z 2025-05-07T20:32:34.4809465Z @given( 2025-05-07T20:32:34.4809711Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:34.4810033Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:34.4810348Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:34.4810705Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:34.4811180Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:34.4811476Z ) 2025-05-07T20:32:34.4811827Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:34.4812283Z def test_silu_mul_quant( 2025-05-07T20:32:34.4812530Z self, 2025-05-07T20:32:34.4812727Z T: int, 2025-05-07T20:32:34.4812934Z D: int, 2025-05-07T20:32:34.4813161Z scale_ub: Optional[float], 2025-05-07T20:32:34.4813429Z contiguous: bool, 2025-05-07T20:32:34.4813745Z compiled: bool, 2025-05-07T20:32:34.4813980Z ) -> None: 2025-05-07T20:32:34.4814200Z torch.manual_seed(2025) 2025-05-07T20:32:34.4814504Z 2025-05-07T20:32:34.4814807Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:34.4815160Z 2025-05-07T20:32:34.4815358Z x_sign = torch.sign(x) 2025-05-07T20:32:34.4815658Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:34.4815976Z x = x_sign * x_clamp 2025-05-07T20:32:34.4816217Z x0 = x[:, :D] 2025-05-07T20:32:34.4816439Z x1 = x[:, D:] 2025-05-07T20:32:34.4816652Z 2025-05-07T20:32:34.4816836Z if contiguous: 2025-05-07T20:32:34.4817071Z x0 = x0.contiguous() 2025-05-07T20:32:34.4817330Z x1 = x1.contiguous() 2025-05-07T20:32:34.4817572Z 2025-05-07T20:32:34.4817766Z if scale_ub is not None: 2025-05-07T20:32:34.4818041Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:34.4818382Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:34.4818694Z ) 2025-05-07T20:32:34.4818892Z else: 2025-05-07T20:32:34.4819116Z scale_ub_tensor = None 2025-05-07T20:32:34.4819368Z 2025-05-07T20:32:34.4819605Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:34.4820006Z op = silu_mul_quant 2025-05-07T20:32:34.4820248Z if compiled: 2025-05-07T20:32:34.4820572Z op = torch.compile(op) 2025-05-07T20:32:34.4820872Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:34.4821138Z 2025-05-07T20:32:34.4821336Z > y_fp8, y_scale = fn() 2025-05-07T20:32:34.4821500Z 2025-05-07T20:32:34.4821605Z moe/activation_test.py:117: 2025-05-07T20:32:34.4821901Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:34.4822233Z moe/activation_test.py:115: in fn 2025-05-07T20:32:34.4822523Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:34.4823082Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:34.4823639Z return fn(*args, **kwargs) 2025-05-07T20:32:34.4824296Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:34.4824986Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:34.4825536Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:34.4826218Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:34.4826873Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:34.4827403Z kernel = self.compile( 2025-05-07T20:32:34.4827942Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:34.4828603Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:34.4828991Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:34.4829222Z 2025-05-07T20:32:34.4829433Z self = 2025-05-07T20:32:34.4830505Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:34.4831916Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fcec13a6560>} 2025-05-07T20:32:34.4833271Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:34.4834338Z context = 2025-05-07T20:32:34.4834629Z 2025-05-07T20:32:34.4834832Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:34.4835352Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:34.4835830Z module_map=module_map) 2025-05-07T20:32:34.4836197Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:34.4836553Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:34.4836806Z E ^ 2025-05-07T20:32:34.4837265Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:34.4837713Z 2025-05-07T20:32:34.4838130Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:34.4838639Z 2025-05-07T20:32:34.5630492Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:34.5630962Z self=, 2025-05-07T20:32:34.5631415Z T=16384, 2025-05-07T20:32:34.5631642Z D=5120, 2025-05-07T20:32:34.5631847Z scale_ub=None, 2025-05-07T20:32:34.5632061Z contiguous=False, 2025-05-07T20:32:34.5632284Z compiled=False, 2025-05-07T20:32:34.5632489Z ) 2025-05-07T20:32:34.5632917Z self = 2025-05-07T20:32:34.5633410Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:34.5633688Z 2025-05-07T20:32:34.5633767Z @given( 2025-05-07T20:32:34.5634003Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:34.5634305Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:34.5634621Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:34.5634947Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:34.5635276Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:34.5635558Z ) 2025-05-07T20:32:34.5635917Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:34.5636358Z def test_silu_mul_quant( 2025-05-07T20:32:34.5636593Z self, 2025-05-07T20:32:34.5636789Z T: int, 2025-05-07T20:32:34.5636989Z D: int, 2025-05-07T20:32:34.5637206Z scale_ub: Optional[float], 2025-05-07T20:32:34.5637479Z contiguous: bool, 2025-05-07T20:32:34.5637721Z compiled: bool, 2025-05-07T20:32:34.5637948Z ) -> None: 2025-05-07T20:32:34.5638173Z torch.manual_seed(2025) 2025-05-07T20:32:34.5638415Z 2025-05-07T20:32:34.5638687Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:34.5639028Z 2025-05-07T20:32:34.5639224Z x_sign = torch.sign(x) 2025-05-07T20:32:34.5639518Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:34.5641524Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:34.5643449Z 2025-05-07T20:32:34.5643566Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:34.5643782Z 2025-05-07T20:32:34.5643884Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:34.5644295Z self=, 2025-05-07T20:32:34.5644688Z T=4096, 2025-05-07T20:32:34.5644872Z D=7168, 2025-05-07T20:32:34.5645071Z scale_ub=1200.0, 2025-05-07T20:32:34.5645350Z contiguous=True, 2025-05-07T20:32:34.5645567Z compiled=True, 2025-05-07T20:32:34.5645770Z ) 2025-05-07T20:32:34.5646148Z self = 2025-05-07T20:32:34.5646637Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:34.5646914Z 2025-05-07T20:32:34.5646989Z @given( 2025-05-07T20:32:34.5647226Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:34.5647536Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:34.5647847Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:34.5648181Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:34.5648502Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:34.5648782Z ) 2025-05-07T20:32:34.5649133Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:34.5649572Z def test_silu_mul_quant( 2025-05-07T20:32:34.5649807Z self, 2025-05-07T20:32:34.5650002Z T: int, 2025-05-07T20:32:34.5650200Z D: int, 2025-05-07T20:32:34.5650417Z scale_ub: Optional[float], 2025-05-07T20:32:34.5650690Z contiguous: bool, 2025-05-07T20:32:34.5650930Z compiled: bool, 2025-05-07T20:32:34.5651149Z ) -> None: 2025-05-07T20:32:34.5651370Z torch.manual_seed(2025) 2025-05-07T20:32:34.5651621Z 2025-05-07T20:32:34.5651935Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:34.5652271Z 2025-05-07T20:32:34.5652466Z x_sign = torch.sign(x) 2025-05-07T20:32:34.5652752Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:34.5654739Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:34.5656593Z 2025-05-07T20:32:34.5656712Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:34.5656927Z 2025-05-07T20:32:34.5657030Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:34.5657446Z self=, 2025-05-07T20:32:34.5657832Z T=16384, 2025-05-07T20:32:34.5658023Z D=7168, 2025-05-07T20:32:34.5658211Z scale_ub=None, 2025-05-07T20:32:34.5658421Z contiguous=False, 2025-05-07T20:32:34.5658647Z compiled=False, 2025-05-07T20:32:34.5658851Z ) 2025-05-07T20:32:34.5659159Z self = 2025-05-07T20:32:34.5659661Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:34.5660040Z 2025-05-07T20:32:34.5660118Z @given( 2025-05-07T20:32:34.5660347Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:34.5660657Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:34.5660973Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:34.5661297Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:34.5661674Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:34.5661963Z ) 2025-05-07T20:32:34.5662308Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:34.5662739Z def test_silu_mul_quant( 2025-05-07T20:32:34.5662978Z self, 2025-05-07T20:32:34.5663174Z T: int, 2025-05-07T20:32:34.5663364Z D: int, 2025-05-07T20:32:34.5663584Z scale_ub: Optional[float], 2025-05-07T20:32:34.5663855Z contiguous: bool, 2025-05-07T20:32:34.5664094Z compiled: bool, 2025-05-07T20:32:34.5664356Z ) -> None: 2025-05-07T20:32:34.5664570Z torch.manual_seed(2025) 2025-05-07T20:32:34.5664812Z 2025-05-07T20:32:34.5665147Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:34.5667184Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:34.5669034Z 2025-05-07T20:32:34.5669152Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:34.5669363Z 2025-05-07T20:32:34.5669464Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:34.5669873Z self=, 2025-05-07T20:32:34.5670259Z T=2048, 2025-05-07T20:32:34.5670441Z D=7168, 2025-05-07T20:32:34.5670630Z scale_ub=1200.0, 2025-05-07T20:32:34.5670843Z contiguous=True, 2025-05-07T20:32:34.5671059Z compiled=True, 2025-05-07T20:32:34.5671264Z ) 2025-05-07T20:32:34.5671613Z self = 2025-05-07T20:32:34.5672112Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:34.5672379Z 2025-05-07T20:32:34.5672455Z @given( 2025-05-07T20:32:34.5672679Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:34.5672980Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:34.5673280Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:34.5673607Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:34.5673927Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:34.5674214Z ) 2025-05-07T20:32:34.5674576Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:34.5675014Z def test_silu_mul_quant( 2025-05-07T20:32:34.5675246Z self, 2025-05-07T20:32:34.5675434Z T: int, 2025-05-07T20:32:34.5675626Z D: int, 2025-05-07T20:32:34.5675837Z scale_ub: Optional[float], 2025-05-07T20:32:34.5676110Z contiguous: bool, 2025-05-07T20:32:34.5676347Z compiled: bool, 2025-05-07T20:32:34.5676558Z ) -> None: 2025-05-07T20:32:34.5676776Z torch.manual_seed(2025) 2025-05-07T20:32:34.5677012Z 2025-05-07T20:32:34.5677274Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:34.5677609Z 2025-05-07T20:32:34.5677798Z x_sign = torch.sign(x) 2025-05-07T20:32:34.5678079Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:34.5680047Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:34.5681922Z 2025-05-07T20:32:34.5682040Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:34.5682256Z 2025-05-07T20:32:34.5682360Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:34.5682770Z self=, 2025-05-07T20:32:34.5683155Z T=2048, 2025-05-07T20:32:34.5683340Z D=7168, 2025-05-07T20:32:34.5683531Z scale_ub=None, 2025-05-07T20:32:34.5683736Z contiguous=True, 2025-05-07T20:32:34.5684003Z compiled=False, 2025-05-07T20:32:34.5684208Z ) 2025-05-07T20:32:34.8758174Z self = 2025-05-07T20:32:34.8758727Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:34.8758998Z 2025-05-07T20:32:34.8759079Z @given( 2025-05-07T20:32:34.8759324Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:34.8759649Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:34.8759952Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:34.8760289Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:34.8760623Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:34.8760900Z ) 2025-05-07T20:32:34.8761255Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:34.8761699Z def test_silu_mul_quant( 2025-05-07T20:32:34.8761941Z self, 2025-05-07T20:32:34.8762138Z T: int, 2025-05-07T20:32:34.8762334Z D: int, 2025-05-07T20:32:34.8762559Z scale_ub: Optional[float], 2025-05-07T20:32:34.8762821Z contiguous: bool, 2025-05-07T20:32:34.8763052Z compiled: bool, 2025-05-07T20:32:34.8763276Z ) -> None: 2025-05-07T20:32:34.8763490Z torch.manual_seed(2025) 2025-05-07T20:32:34.8763736Z 2025-05-07T20:32:34.8764011Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:34.8764419Z 2025-05-07T20:32:34.8764625Z > x_sign = torch.sign(x) 2025-05-07T20:32:34.8766597Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:34.8768474Z 2025-05-07T20:32:34.8768599Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:34.8768812Z 2025-05-07T20:32:34.8768922Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:34.8769327Z self=, 2025-05-07T20:32:34.8769735Z T=1, 2025-05-07T20:32:34.8769916Z D=7168, 2025-05-07T20:32:34.8770111Z scale_ub=1200.0, 2025-05-07T20:32:34.8770338Z contiguous=True, 2025-05-07T20:32:34.8770562Z compiled=False, 2025-05-07T20:32:34.8770763Z ) 2025-05-07T20:32:34.8771074Z self = 2025-05-07T20:32:34.8771559Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:34.8771817Z 2025-05-07T20:32:34.8771895Z @given( 2025-05-07T20:32:34.8772138Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:34.8772446Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:34.8772751Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:34.8773075Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:34.8773408Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:34.8773687Z ) 2025-05-07T20:32:34.8774168Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:34.8774604Z def test_silu_mul_quant( 2025-05-07T20:32:34.8774844Z self, 2025-05-07T20:32:34.8775037Z T: int, 2025-05-07T20:32:34.8775236Z D: int, 2025-05-07T20:32:34.8775467Z scale_ub: Optional[float], 2025-05-07T20:32:34.8775734Z contiguous: bool, 2025-05-07T20:32:34.8775971Z compiled: bool, 2025-05-07T20:32:34.8776192Z ) -> None: 2025-05-07T20:32:34.8776399Z torch.manual_seed(2025) 2025-05-07T20:32:34.8776698Z 2025-05-07T20:32:34.8776968Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:34.8777298Z 2025-05-07T20:32:34.8777525Z x_sign = torch.sign(x) 2025-05-07T20:32:34.8777819Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:34.8778125Z x = x_sign * x_clamp 2025-05-07T20:32:34.8778365Z x0 = x[:, :D] 2025-05-07T20:32:34.8778590Z x1 = x[:, D:] 2025-05-07T20:32:34.8778807Z 2025-05-07T20:32:34.8778992Z if contiguous: 2025-05-07T20:32:34.8779225Z x0 = x0.contiguous() 2025-05-07T20:32:34.8779483Z x1 = x1.contiguous() 2025-05-07T20:32:34.8779708Z 2025-05-07T20:32:34.8779968Z if scale_ub is not None: 2025-05-07T20:32:34.8780238Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:34.8780567Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:34.8780894Z ) 2025-05-07T20:32:34.8781114Z else: 2025-05-07T20:32:34.8781327Z scale_ub_tensor = None 2025-05-07T20:32:34.8781580Z 2025-05-07T20:32:34.8781814Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:34.8782125Z op = silu_mul_quant 2025-05-07T20:32:34.8782372Z if compiled: 2025-05-07T20:32:34.8782622Z op = torch.compile(op) 2025-05-07T20:32:34.8782917Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:34.8783182Z 2025-05-07T20:32:34.8783421Z > y_fp8, y_scale = fn() 2025-05-07T20:32:34.8783588Z 2025-05-07T20:32:34.8783693Z moe/activation_test.py:117: 2025-05-07T20:32:34.8783980Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:34.8784310Z moe/activation_test.py:115: in fn 2025-05-07T20:32:34.8784591Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:34.8785276Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:34.8785965Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:34.8786495Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:34.8787173Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:34.8787824Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:34.8788355Z kernel = self.compile( 2025-05-07T20:32:34.8788891Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:34.8789541Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:34.8790170Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:34.8790406Z 2025-05-07T20:32:34.8790617Z self = 2025-05-07T20:32:34.8791741Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:34.8793101Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fcec0f644c0>} 2025-05-07T20:32:34.8794511Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:34.8795537Z context = 2025-05-07T20:32:34.8795826Z 2025-05-07T20:32:34.8795993Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:34.8796502Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:34.8797028Z module_map=module_map) 2025-05-07T20:32:34.8797443Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:34.8797791Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:34.8798041Z E ^ 2025-05-07T20:32:34.8798497Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:34.8798944Z 2025-05-07T20:32:34.8799359Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:34.8799861Z 2025-05-07T20:32:34.8799967Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:34.8800366Z self=, 2025-05-07T20:32:34.8800761Z T=128, 2025-05-07T20:32:34.8800943Z D=5120, 2025-05-07T20:32:34.8801127Z scale_ub=None, 2025-05-07T20:32:34.8801334Z contiguous=True, 2025-05-07T20:32:34.8801552Z compiled=False, 2025-05-07T20:32:34.8801761Z ) 2025-05-07T20:32:34.9583794Z self = 2025-05-07T20:32:34.9584364Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:34.9584635Z 2025-05-07T20:32:34.9584713Z @given( 2025-05-07T20:32:34.9584937Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:34.9585249Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:34.9585678Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:34.9586015Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:34.9586341Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:34.9586636Z ) 2025-05-07T20:32:34.9586985Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:34.9587419Z def test_silu_mul_quant( 2025-05-07T20:32:34.9587660Z self, 2025-05-07T20:32:34.9587859Z T: int, 2025-05-07T20:32:34.9588058Z D: int, 2025-05-07T20:32:34.9588283Z scale_ub: Optional[float], 2025-05-07T20:32:34.9588582Z contiguous: bool, 2025-05-07T20:32:34.9588829Z compiled: bool, 2025-05-07T20:32:34.9589058Z ) -> None: 2025-05-07T20:32:34.9589271Z torch.manual_seed(2025) 2025-05-07T20:32:34.9589518Z 2025-05-07T20:32:34.9589791Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:34.9590449Z 2025-05-07T20:32:34.9596490Z x_sign = torch.sign(x) 2025-05-07T20:32:34.9596831Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:34.9597154Z x = x_sign * x_clamp 2025-05-07T20:32:34.9597393Z x0 = x[:, :D] 2025-05-07T20:32:34.9597614Z x1 = x[:, D:] 2025-05-07T20:32:34.9597820Z 2025-05-07T20:32:34.9598003Z if contiguous: 2025-05-07T20:32:34.9598237Z x0 = x0.contiguous() 2025-05-07T20:32:34.9598498Z x1 = x1.contiguous() 2025-05-07T20:32:34.9598741Z 2025-05-07T20:32:34.9598938Z if scale_ub is not None: 2025-05-07T20:32:34.9599214Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:34.9599550Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:34.9599863Z ) 2025-05-07T20:32:34.9600057Z else: 2025-05-07T20:32:34.9600270Z scale_ub_tensor = None 2025-05-07T20:32:34.9600515Z 2025-05-07T20:32:34.9600886Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:34.9601242Z op = silu_mul_quant 2025-05-07T20:32:34.9601491Z if compiled: 2025-05-07T20:32:34.9601744Z op = torch.compile(op) 2025-05-07T20:32:34.9602046Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:34.9602319Z 2025-05-07T20:32:34.9602517Z > y_fp8, y_scale = fn() 2025-05-07T20:32:34.9602682Z 2025-05-07T20:32:34.9602793Z moe/activation_test.py:117: 2025-05-07T20:32:34.9603083Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:34.9603489Z moe/activation_test.py:115: in fn 2025-05-07T20:32:34.9603775Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:34.9604517Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:34.9605199Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:34.9605745Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:34.9606428Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:34.9607085Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:34.9607613Z kernel = self.compile( 2025-05-07T20:32:34.9608163Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:34.9608824Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:34.9609219Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:34.9609451Z 2025-05-07T20:32:34.9609657Z self = 2025-05-07T20:32:34.9610815Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:34.9612205Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fcec0f64940>} 2025-05-07T20:32:34.9613535Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:34.9614568Z context = 2025-05-07T20:32:34.9614856Z 2025-05-07T20:32:34.9615028Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:34.9615552Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:34.9616011Z module_map=module_map) 2025-05-07T20:32:34.9616385Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:34.9616741Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:34.9616996Z E ^ 2025-05-07T20:32:34.9617459Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:34.9617908Z 2025-05-07T20:32:34.9618321Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:34.9618828Z 2025-05-07T20:32:34.9618941Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:34.9619354Z self=, 2025-05-07T20:32:34.9619826Z T=128, 2025-05-07T20:32:34.9620020Z D=7168, 2025-05-07T20:32:34.9620213Z scale_ub=None, 2025-05-07T20:32:34.9620424Z contiguous=True, 2025-05-07T20:32:34.9620646Z compiled=False, 2025-05-07T20:32:34.9620851Z ) 2025-05-07T20:32:34.9621221Z self = 2025-05-07T20:32:34.9621765Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:34.9622031Z 2025-05-07T20:32:34.9622116Z @given( 2025-05-07T20:32:34.9622341Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:34.9622658Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:34.9622965Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:34.9623289Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:34.9623616Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:34.9623956Z ) 2025-05-07T20:32:34.9624340Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:34.9624787Z def test_silu_mul_quant( 2025-05-07T20:32:34.9625032Z self, 2025-05-07T20:32:34.9625226Z T: int, 2025-05-07T20:32:34.9625425Z D: int, 2025-05-07T20:32:34.9625647Z scale_ub: Optional[float], 2025-05-07T20:32:34.9625917Z contiguous: bool, 2025-05-07T20:32:34.9626159Z compiled: bool, 2025-05-07T20:32:34.9626383Z ) -> None: 2025-05-07T20:32:34.9626593Z torch.manual_seed(2025) 2025-05-07T20:32:34.9626830Z 2025-05-07T20:32:34.9627102Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:34.9627441Z 2025-05-07T20:32:34.9627642Z x_sign = torch.sign(x) 2025-05-07T20:32:34.9627928Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:34.9628235Z x = x_sign * x_clamp 2025-05-07T20:32:34.9628479Z x0 = x[:, :D] 2025-05-07T20:32:34.9628696Z x1 = x[:, D:] 2025-05-07T20:32:34.9628897Z 2025-05-07T20:32:34.9629087Z if contiguous: 2025-05-07T20:32:34.9629319Z x0 = x0.contiguous() 2025-05-07T20:32:34.9629569Z x1 = x1.contiguous() 2025-05-07T20:32:34.9629802Z 2025-05-07T20:32:34.9630000Z if scale_ub is not None: 2025-05-07T20:32:34.9630275Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:34.9630651Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:34.9630962Z ) 2025-05-07T20:32:34.9631184Z else: 2025-05-07T20:32:34.9631415Z scale_ub_tensor = None 2025-05-07T20:32:34.9631665Z 2025-05-07T20:32:34.9631896Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:34.9632203Z op = silu_mul_quant 2025-05-07T20:32:34.9632457Z if compiled: 2025-05-07T20:32:34.9632708Z op = torch.compile(op) 2025-05-07T20:32:34.9633003Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:34.9633282Z 2025-05-07T20:32:34.9633478Z > y_fp8, y_scale = fn() 2025-05-07T20:32:34.9633644Z 2025-05-07T20:32:34.9633743Z moe/activation_test.py:117: 2025-05-07T20:32:34.9634037Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:34.9634366Z moe/activation_test.py:115: in fn 2025-05-07T20:32:34.9634650Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:34.9635330Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:34.9636013Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:34.9636551Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:34.9637224Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:34.9637888Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:34.9638415Z kernel = self.compile( 2025-05-07T20:32:34.9638955Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:34.9639603Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:34.9640000Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:34.9640274Z 2025-05-07T20:32:34.9640483Z self = 2025-05-07T20:32:34.9641598Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:34.9642945Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fcec0f65240>} 2025-05-07T20:32:34.9644354Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:34.9645384Z context = 2025-05-07T20:32:34.9645670Z 2025-05-07T20:32:34.9645844Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:34.9646362Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:34.9646827Z module_map=module_map) 2025-05-07T20:32:34.9647193Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:34.9647543Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:34.9647801Z E ^ 2025-05-07T20:32:34.9648267Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:34.9648715Z 2025-05-07T20:32:34.9649136Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:34.9649654Z 2025-05-07T20:32:34.9649767Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:34.9650177Z self=, 2025-05-07T20:32:34.9650618Z T=2048, 2025-05-07T20:32:34.9650833Z D=7168, 2025-05-07T20:32:34.9651054Z scale_ub=1200.0, 2025-05-07T20:32:34.9651273Z contiguous=True, 2025-05-07T20:32:34.9651496Z compiled=False, 2025-05-07T20:32:34.9651695Z ) 2025-05-07T20:32:35.0626864Z self = 2025-05-07T20:32:35.0627378Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:35.0627655Z 2025-05-07T20:32:35.0627772Z @given( 2025-05-07T20:32:35.0628022Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.0628333Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.0628648Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.0628985Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.0629311Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.0629605Z ) 2025-05-07T20:32:35.0629966Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.0630407Z def test_silu_mul_quant( 2025-05-07T20:32:35.0630648Z self, 2025-05-07T20:32:35.0630843Z T: int, 2025-05-07T20:32:35.0631041Z D: int, 2025-05-07T20:32:35.0631264Z scale_ub: Optional[float], 2025-05-07T20:32:35.0631540Z contiguous: bool, 2025-05-07T20:32:35.0631787Z compiled: bool, 2025-05-07T20:32:35.0632013Z ) -> None: 2025-05-07T20:32:35.0632229Z torch.manual_seed(2025) 2025-05-07T20:32:35.0632489Z 2025-05-07T20:32:35.0632758Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.0634797Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:35.0636749Z 2025-05-07T20:32:35.0636872Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:35.0637094Z 2025-05-07T20:32:35.0637197Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.0637610Z self=, 2025-05-07T20:32:35.0638081Z T=1, 2025-05-07T20:32:35.0638275Z D=5120, 2025-05-07T20:32:35.0638476Z scale_ub=1200.0, 2025-05-07T20:32:35.0638791Z contiguous=True, 2025-05-07T20:32:35.0639018Z compiled=False, 2025-05-07T20:32:35.0639231Z ) 2025-05-07T20:32:35.0639545Z self = 2025-05-07T20:32:35.0640022Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:35.0640295Z 2025-05-07T20:32:35.0640375Z @given( 2025-05-07T20:32:35.0640606Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.0640934Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.0641271Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.0641604Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.0641929Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.0642211Z ) 2025-05-07T20:32:35.0642578Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.0643023Z def test_silu_mul_quant( 2025-05-07T20:32:35.0643261Z self, 2025-05-07T20:32:35.0643462Z T: int, 2025-05-07T20:32:35.0643661Z D: int, 2025-05-07T20:32:35.0643878Z scale_ub: Optional[float], 2025-05-07T20:32:35.0644159Z contiguous: bool, 2025-05-07T20:32:35.0644399Z compiled: bool, 2025-05-07T20:32:35.0644620Z ) -> None: 2025-05-07T20:32:35.0644909Z torch.manual_seed(2025) 2025-05-07T20:32:35.0645151Z 2025-05-07T20:32:35.0645416Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.0645754Z 2025-05-07T20:32:35.0645955Z x_sign = torch.sign(x) 2025-05-07T20:32:35.0646243Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.0646555Z x = x_sign * x_clamp 2025-05-07T20:32:35.0646799Z x0 = x[:, :D] 2025-05-07T20:32:35.0647019Z x1 = x[:, D:] 2025-05-07T20:32:35.0647223Z 2025-05-07T20:32:35.0647407Z if contiguous: 2025-05-07T20:32:35.0647637Z x0 = x0.contiguous() 2025-05-07T20:32:35.0647899Z x1 = x1.contiguous() 2025-05-07T20:32:35.0648140Z 2025-05-07T20:32:35.0648345Z if scale_ub is not None: 2025-05-07T20:32:35.0648615Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.0648956Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.0649272Z ) 2025-05-07T20:32:35.0649463Z else: 2025-05-07T20:32:35.0649682Z scale_ub_tensor = None 2025-05-07T20:32:35.0649934Z 2025-05-07T20:32:35.0650162Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.0650479Z op = silu_mul_quant 2025-05-07T20:32:35.0650731Z if compiled: 2025-05-07T20:32:35.0650999Z op = torch.compile(op) 2025-05-07T20:32:35.0651317Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.0651589Z 2025-05-07T20:32:35.0651783Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.0651946Z 2025-05-07T20:32:35.0652051Z moe/activation_test.py:117: 2025-05-07T20:32:35.0652350Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.0652678Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.0652955Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.0653649Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.0654389Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.0654933Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.0655606Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.0656270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.0656844Z kernel = self.compile( 2025-05-07T20:32:35.0657421Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.0658072Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.0658474Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.0658699Z 2025-05-07T20:32:35.0658920Z self = 2025-05-07T20:32:35.0660078Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.0661481Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fcec0f66200>} 2025-05-07T20:32:35.0662815Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.0663845Z context = 2025-05-07T20:32:35.0664131Z 2025-05-07T20:32:35.0664304Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.0664863Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.0665326Z module_map=module_map) 2025-05-07T20:32:35.0665693Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.0666037Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.0666297Z E ^ 2025-05-07T20:32:35.0666764Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.0667204Z 2025-05-07T20:32:35.0667623Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.0668130Z 2025-05-07T20:32:35.0668235Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.0668641Z self=, 2025-05-07T20:32:35.0669034Z T=2048, 2025-05-07T20:32:35.0669213Z D=5120, 2025-05-07T20:32:35.0669400Z scale_ub=None, 2025-05-07T20:32:35.0669613Z contiguous=True, 2025-05-07T20:32:35.0669830Z compiled=False, 2025-05-07T20:32:35.0670030Z ) 2025-05-07T20:32:35.0670346Z self = 2025-05-07T20:32:35.0670836Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:35.0671107Z 2025-05-07T20:32:35.0671182Z @given( 2025-05-07T20:32:35.0671414Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.0671722Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.0672026Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.0672349Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.0672674Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.0672946Z ) 2025-05-07T20:32:35.0673289Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.0673725Z def test_silu_mul_quant( 2025-05-07T20:32:35.0674012Z self, 2025-05-07T20:32:35.0674204Z T: int, 2025-05-07T20:32:35.0674396Z D: int, 2025-05-07T20:32:35.0674608Z scale_ub: Optional[float], 2025-05-07T20:32:35.0674873Z contiguous: bool, 2025-05-07T20:32:35.0675108Z compiled: bool, 2025-05-07T20:32:35.0675328Z ) -> None: 2025-05-07T20:32:35.0675536Z torch.manual_seed(2025) 2025-05-07T20:32:35.0675772Z 2025-05-07T20:32:35.0676043Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.0676424Z 2025-05-07T20:32:35.0676619Z > x_sign = torch.sign(x) 2025-05-07T20:32:35.0678588Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:35.0680456Z 2025-05-07T20:32:35.0680581Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:35.0680793Z 2025-05-07T20:32:35.0680902Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.0681306Z self=, 2025-05-07T20:32:35.0681702Z T=16384, 2025-05-07T20:32:35.0681894Z D=5120, 2025-05-07T20:32:35.0682082Z scale_ub=None, 2025-05-07T20:32:35.0682287Z contiguous=True, 2025-05-07T20:32:35.0682509Z compiled=False, 2025-05-07T20:32:35.0682702Z ) 2025-05-07T20:32:35.1657826Z self = 2025-05-07T20:32:35.1658412Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:35.1658704Z 2025-05-07T20:32:35.1658897Z @given( 2025-05-07T20:32:35.1659138Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.1659447Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.1659871Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.1660213Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.1660537Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.1660818Z ) 2025-05-07T20:32:35.1661159Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.1661595Z def test_silu_mul_quant( 2025-05-07T20:32:35.1661826Z self, 2025-05-07T20:32:35.1662021Z T: int, 2025-05-07T20:32:35.1662223Z D: int, 2025-05-07T20:32:35.1662439Z scale_ub: Optional[float], 2025-05-07T20:32:35.1662713Z contiguous: bool, 2025-05-07T20:32:35.1662961Z compiled: bool, 2025-05-07T20:32:35.1663183Z ) -> None: 2025-05-07T20:32:35.1663419Z torch.manual_seed(2025) 2025-05-07T20:32:35.1663678Z 2025-05-07T20:32:35.1663945Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.1665996Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:35.1667847Z 2025-05-07T20:32:35.1667967Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:35.1668184Z 2025-05-07T20:32:35.1668293Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.1668789Z self=, 2025-05-07T20:32:35.1669190Z T=4096, 2025-05-07T20:32:35.1669385Z D=5120, 2025-05-07T20:32:35.1669587Z scale_ub=None, 2025-05-07T20:32:35.1669799Z contiguous=True, 2025-05-07T20:32:35.1670028Z compiled=False, 2025-05-07T20:32:35.1670235Z ) 2025-05-07T20:32:35.1670544Z self = 2025-05-07T20:32:35.1671040Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:35.1671317Z 2025-05-07T20:32:35.1671470Z @given( 2025-05-07T20:32:35.1671703Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.1672908Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.1673235Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.1673565Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.1673891Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.1674187Z ) 2025-05-07T20:32:35.1674544Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.1674979Z def test_silu_mul_quant( 2025-05-07T20:32:35.1675227Z self, 2025-05-07T20:32:35.1675421Z T: int, 2025-05-07T20:32:35.1675613Z D: int, 2025-05-07T20:32:35.1675827Z scale_ub: Optional[float], 2025-05-07T20:32:35.1676098Z contiguous: bool, 2025-05-07T20:32:35.1676341Z compiled: bool, 2025-05-07T20:32:35.1676558Z ) -> None: 2025-05-07T20:32:35.1676776Z torch.manual_seed(2025) 2025-05-07T20:32:35.1677019Z 2025-05-07T20:32:35.1677284Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.1679342Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:35.1681210Z 2025-05-07T20:32:35.1681328Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:35.1681537Z 2025-05-07T20:32:35.1681644Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.1682058Z self=, 2025-05-07T20:32:35.1682450Z T=2048, 2025-05-07T20:32:35.1682629Z D=5120, 2025-05-07T20:32:35.1682820Z scale_ub=None, 2025-05-07T20:32:35.1683030Z contiguous=False, 2025-05-07T20:32:35.1683248Z compiled=False, 2025-05-07T20:32:35.1683450Z ) 2025-05-07T20:32:35.1683761Z self = 2025-05-07T20:32:35.1684253Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:35.1684521Z 2025-05-07T20:32:35.1684603Z @given( 2025-05-07T20:32:35.1684829Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.1685135Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.1685444Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.1685772Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.1686093Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.1686381Z ) 2025-05-07T20:32:35.1686732Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.1687164Z def test_silu_mul_quant( 2025-05-07T20:32:35.1687408Z self, 2025-05-07T20:32:35.1687607Z T: int, 2025-05-07T20:32:35.1687801Z D: int, 2025-05-07T20:32:35.1688019Z scale_ub: Optional[float], 2025-05-07T20:32:35.1688292Z contiguous: bool, 2025-05-07T20:32:35.1688577Z compiled: bool, 2025-05-07T20:32:35.1688806Z ) -> None: 2025-05-07T20:32:35.1689023Z torch.manual_seed(2025) 2025-05-07T20:32:35.1689256Z 2025-05-07T20:32:35.1689528Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.1697402Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:35.1699313Z 2025-05-07T20:32:35.1699440Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:35.1699669Z 2025-05-07T20:32:35.1699851Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.1700273Z self=, 2025-05-07T20:32:35.1700675Z T=4096, 2025-05-07T20:32:35.1700864Z D=7168, 2025-05-07T20:32:35.1701094Z scale_ub=None, 2025-05-07T20:32:35.1701326Z contiguous=True, 2025-05-07T20:32:35.1701554Z compiled=True, 2025-05-07T20:32:35.1701765Z ) 2025-05-07T20:32:35.1702082Z self = 2025-05-07T20:32:35.1702580Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:35.1702858Z 2025-05-07T20:32:35.1702948Z @given( 2025-05-07T20:32:35.1703187Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.1703505Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.1703818Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.1704156Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.1704555Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.1704854Z ) 2025-05-07T20:32:35.1705213Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.1705655Z def test_silu_mul_quant( 2025-05-07T20:32:35.1705903Z self, 2025-05-07T20:32:35.1706107Z T: int, 2025-05-07T20:32:35.1706303Z D: int, 2025-05-07T20:32:35.1706528Z scale_ub: Optional[float], 2025-05-07T20:32:35.1706806Z contiguous: bool, 2025-05-07T20:32:35.1707046Z compiled: bool, 2025-05-07T20:32:35.1707280Z ) -> None: 2025-05-07T20:32:35.1707498Z torch.manual_seed(2025) 2025-05-07T20:32:35.1707744Z 2025-05-07T20:32:35.1708026Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.1710059Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:35.1711900Z 2025-05-07T20:32:35.1712020Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:35.1712230Z 2025-05-07T20:32:35.1712345Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.1712763Z self=, 2025-05-07T20:32:35.1713169Z T=2048, 2025-05-07T20:32:35.1713363Z D=5120, 2025-05-07T20:32:35.1713550Z scale_ub=1200.0, 2025-05-07T20:32:35.1713780Z contiguous=False, 2025-05-07T20:32:35.1714008Z compiled=False, 2025-05-07T20:32:35.1714219Z ) 2025-05-07T20:32:35.1714537Z self = 2025-05-07T20:32:35.1715100Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:35.1715373Z 2025-05-07T20:32:35.1715454Z @given( 2025-05-07T20:32:35.1715681Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.1715991Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.1716299Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.1716623Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.1716951Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.1717288Z ) 2025-05-07T20:32:35.1717674Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.1718128Z def test_silu_mul_quant( 2025-05-07T20:32:35.1718375Z self, 2025-05-07T20:32:35.1718588Z T: int, 2025-05-07T20:32:35.1718782Z D: int, 2025-05-07T20:32:35.1719004Z scale_ub: Optional[float], 2025-05-07T20:32:35.1719281Z contiguous: bool, 2025-05-07T20:32:35.1719520Z compiled: bool, 2025-05-07T20:32:35.1719756Z ) -> None: 2025-05-07T20:32:35.1719974Z torch.manual_seed(2025) 2025-05-07T20:32:35.1720217Z 2025-05-07T20:32:35.1720491Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.1722580Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:35.1724411Z 2025-05-07T20:32:35.1724531Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:35.1724787Z 2025-05-07T20:32:35.1724900Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.1725310Z self=, 2025-05-07T20:32:35.1725717Z T=4096, 2025-05-07T20:32:35.1725911Z D=7168, 2025-05-07T20:32:35.1726105Z scale_ub=1200.0, 2025-05-07T20:32:35.1726328Z contiguous=True, 2025-05-07T20:32:35.1726552Z compiled=False, 2025-05-07T20:32:35.1726758Z ) 2025-05-07T20:32:35.2987696Z self = 2025-05-07T20:32:35.2988252Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:35.2988524Z 2025-05-07T20:32:35.2988608Z @given( 2025-05-07T20:32:35.2988842Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.2989152Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.2989459Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.2989792Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.2990262Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.2990655Z ) 2025-05-07T20:32:35.2991009Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.2991449Z def test_silu_mul_quant( 2025-05-07T20:32:35.2991690Z self, 2025-05-07T20:32:35.2991880Z T: int, 2025-05-07T20:32:35.2992083Z D: int, 2025-05-07T20:32:35.2992299Z scale_ub: Optional[float], 2025-05-07T20:32:35.2992562Z contiguous: bool, 2025-05-07T20:32:35.2992808Z compiled: bool, 2025-05-07T20:32:35.2993032Z ) -> None: 2025-05-07T20:32:35.2993249Z torch.manual_seed(2025) 2025-05-07T20:32:35.2993488Z 2025-05-07T20:32:35.2993764Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.2995798Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:35.2997753Z 2025-05-07T20:32:35.2997881Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:35.2998157Z 2025-05-07T20:32:35.2998260Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.2998729Z self=, 2025-05-07T20:32:35.2999131Z T=16384, 2025-05-07T20:32:35.2999324Z D=7168, 2025-05-07T20:32:35.2999513Z scale_ub=None, 2025-05-07T20:32:35.2999732Z contiguous=False, 2025-05-07T20:32:35.2999953Z compiled=True, 2025-05-07T20:32:35.3000163Z ) 2025-05-07T20:32:35.3000479Z self = 2025-05-07T20:32:35.3000967Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:35.3001243Z 2025-05-07T20:32:35.3001318Z @given( 2025-05-07T20:32:35.3001547Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.3001858Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.3002160Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.3002492Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.3002823Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.3003098Z ) 2025-05-07T20:32:35.3003460Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.3003902Z def test_silu_mul_quant( 2025-05-07T20:32:35.3004136Z self, 2025-05-07T20:32:35.3004332Z T: int, 2025-05-07T20:32:35.3004531Z D: int, 2025-05-07T20:32:35.3004852Z scale_ub: Optional[float], 2025-05-07T20:32:35.3005129Z contiguous: bool, 2025-05-07T20:32:35.3005369Z compiled: bool, 2025-05-07T20:32:35.3005585Z ) -> None: 2025-05-07T20:32:35.3005808Z torch.manual_seed(2025) 2025-05-07T20:32:35.3006049Z 2025-05-07T20:32:35.3006322Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.3008348Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:35.3010194Z 2025-05-07T20:32:35.3010318Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:35.3010531Z 2025-05-07T20:32:35.3010638Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.3011054Z self=, 2025-05-07T20:32:35.3011479Z T=4096, 2025-05-07T20:32:35.3011665Z D=7168, 2025-05-07T20:32:35.3011855Z scale_ub=None, 2025-05-07T20:32:35.3012064Z contiguous=True, 2025-05-07T20:32:35.3012282Z compiled=False, 2025-05-07T20:32:35.3012489Z ) 2025-05-07T20:32:35.3012815Z self = 2025-05-07T20:32:35.3013307Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:35.3013570Z 2025-05-07T20:32:35.3013644Z @given( 2025-05-07T20:32:35.3013868Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.3014184Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.3014543Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.3015008Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.3015346Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.3015628Z ) 2025-05-07T20:32:35.3015981Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.3016425Z def test_silu_mul_quant( 2025-05-07T20:32:35.3016665Z self, 2025-05-07T20:32:35.3016859Z T: int, 2025-05-07T20:32:35.3017051Z D: int, 2025-05-07T20:32:35.3017320Z scale_ub: Optional[float], 2025-05-07T20:32:35.3017587Z contiguous: bool, 2025-05-07T20:32:35.3017830Z compiled: bool, 2025-05-07T20:32:35.3018089Z ) -> None: 2025-05-07T20:32:35.3018302Z torch.manual_seed(2025) 2025-05-07T20:32:35.3018542Z 2025-05-07T20:32:35.3018815Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.3020950Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:35.3022796Z 2025-05-07T20:32:35.3022914Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:35.3023128Z 2025-05-07T20:32:35.3023235Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.3023649Z self=, 2025-05-07T20:32:35.3024043Z T=16384, 2025-05-07T20:32:35.3024227Z D=7168, 2025-05-07T20:32:35.3024417Z scale_ub=None, 2025-05-07T20:32:35.3024639Z contiguous=True, 2025-05-07T20:32:35.3024929Z compiled=False, 2025-05-07T20:32:35.3025137Z ) 2025-05-07T20:32:35.3025448Z self = 2025-05-07T20:32:35.3025936Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:35.3026215Z 2025-05-07T20:32:35.3026295Z @given( 2025-05-07T20:32:35.3026522Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.3026831Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.3027132Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.3027464Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.3027804Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.3028083Z ) 2025-05-07T20:32:35.3028434Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.3028866Z def test_silu_mul_quant( 2025-05-07T20:32:35.3029101Z self, 2025-05-07T20:32:35.3029304Z T: int, 2025-05-07T20:32:35.3029509Z D: int, 2025-05-07T20:32:35.3029725Z scale_ub: Optional[float], 2025-05-07T20:32:35.3030000Z contiguous: bool, 2025-05-07T20:32:35.3030244Z compiled: bool, 2025-05-07T20:32:35.3030459Z ) -> None: 2025-05-07T20:32:35.3030685Z torch.manual_seed(2025) 2025-05-07T20:32:35.3030921Z 2025-05-07T20:32:35.3031189Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.3033235Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:35.3035120Z 2025-05-07T20:32:35.3035248Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:35.3035457Z 2025-05-07T20:32:35.3035562Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.3035978Z self=, 2025-05-07T20:32:35.3036380Z T=16384, 2025-05-07T20:32:35.3036568Z D=7168, 2025-05-07T20:32:35.3036763Z scale_ub=1200.0, 2025-05-07T20:32:35.3036984Z contiguous=True, 2025-05-07T20:32:35.3037245Z compiled=False, 2025-05-07T20:32:35.3037445Z ) 2025-05-07T20:32:35.3037793Z self = 2025-05-07T20:32:35.3038285Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:35.3038562Z 2025-05-07T20:32:35.3038636Z @given( 2025-05-07T20:32:35.3038862Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.3039173Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.3039470Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.3039798Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.3040122Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.3040400Z ) 2025-05-07T20:32:35.3040745Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.3041231Z def test_silu_mul_quant( 2025-05-07T20:32:35.3041468Z self, 2025-05-07T20:32:35.3041659Z T: int, 2025-05-07T20:32:35.3041853Z D: int, 2025-05-07T20:32:35.3042075Z scale_ub: Optional[float], 2025-05-07T20:32:35.3042343Z contiguous: bool, 2025-05-07T20:32:35.3042581Z compiled: bool, 2025-05-07T20:32:35.3042803Z ) -> None: 2025-05-07T20:32:35.3043013Z torch.manual_seed(2025) 2025-05-07T20:32:35.3043257Z 2025-05-07T20:32:35.3043528Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.3045590Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:35.3047435Z 2025-05-07T20:32:35.3047555Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:35.3047776Z 2025-05-07T20:32:35.3047880Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.3048287Z self=, 2025-05-07T20:32:35.3048679Z T=128, 2025-05-07T20:32:35.3048857Z D=5120, 2025-05-07T20:32:35.3049050Z scale_ub=1200.0, 2025-05-07T20:32:35.3049273Z contiguous=False, 2025-05-07T20:32:35.3049490Z compiled=False, 2025-05-07T20:32:35.3049689Z ) 2025-05-07T20:32:35.4460656Z self = 2025-05-07T20:32:35.4461259Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:35.4461534Z 2025-05-07T20:32:35.4461629Z @given( 2025-05-07T20:32:35.4461858Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.4462171Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.4462494Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.4462824Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.4463157Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.4463447Z ) 2025-05-07T20:32:35.4463796Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.4464357Z def test_silu_mul_quant( 2025-05-07T20:32:35.4464601Z self, 2025-05-07T20:32:35.4464792Z T: int, 2025-05-07T20:32:35.4464994Z D: int, 2025-05-07T20:32:35.4465218Z scale_ub: Optional[float], 2025-05-07T20:32:35.4465478Z contiguous: bool, 2025-05-07T20:32:35.4465723Z compiled: bool, 2025-05-07T20:32:35.4465952Z ) -> None: 2025-05-07T20:32:35.4466166Z torch.manual_seed(2025) 2025-05-07T20:32:35.4466403Z 2025-05-07T20:32:35.4466675Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.4467089Z 2025-05-07T20:32:35.4467278Z x_sign = torch.sign(x) 2025-05-07T20:32:35.4467620Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.4467937Z x = x_sign * x_clamp 2025-05-07T20:32:35.4468166Z x0 = x[:, :D] 2025-05-07T20:32:35.4468374Z x1 = x[:, D:] 2025-05-07T20:32:35.4468570Z 2025-05-07T20:32:35.4468747Z if contiguous: 2025-05-07T20:32:35.4468986Z x0 = x0.contiguous() 2025-05-07T20:32:35.4469246Z x1 = x1.contiguous() 2025-05-07T20:32:35.4469481Z 2025-05-07T20:32:35.4469674Z if scale_ub is not None: 2025-05-07T20:32:35.4469949Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.4470282Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.4470581Z ) 2025-05-07T20:32:35.4470769Z else: 2025-05-07T20:32:35.4470977Z scale_ub_tensor = None 2025-05-07T20:32:35.4471223Z 2025-05-07T20:32:35.4471456Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.4471767Z op = silu_mul_quant 2025-05-07T20:32:35.4472016Z if compiled: 2025-05-07T20:32:35.4472269Z op = torch.compile(op) 2025-05-07T20:32:35.4472557Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.4472817Z 2025-05-07T20:32:35.4473009Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.4473172Z 2025-05-07T20:32:35.4473342Z moe/activation_test.py:117: 2025-05-07T20:32:35.4473636Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.4473963Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.4474242Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.4474940Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.4475633Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.4476159Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.4476833Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.4477492Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.4478018Z kernel = self.compile( 2025-05-07T20:32:35.4478560Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.4479212Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.4479601Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.4479832Z 2025-05-07T20:32:35.4480035Z self = 2025-05-07T20:32:35.4481107Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.4482480Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fcec1229ea0>} 2025-05-07T20:32:35.4483803Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.4484875Z context = 2025-05-07T20:32:35.4485162Z 2025-05-07T20:32:35.4485328Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.4485841Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.4486294Z module_map=module_map) 2025-05-07T20:32:35.4486694Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.4487041Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.4487339Z E ^ 2025-05-07T20:32:35.4487793Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.4488234Z 2025-05-07T20:32:35.4488645Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.4489150Z 2025-05-07T20:32:35.4489254Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.4489651Z self=, 2025-05-07T20:32:35.4490266Z T=2048, 2025-05-07T20:32:35.4490520Z D=7168, 2025-05-07T20:32:35.4490712Z scale_ub=None, 2025-05-07T20:32:35.4490923Z contiguous=False, 2025-05-07T20:32:35.4491169Z compiled=False, 2025-05-07T20:32:35.4491391Z ) 2025-05-07T20:32:35.4491699Z self = 2025-05-07T20:32:35.4492204Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:35.4492474Z 2025-05-07T20:32:35.4492556Z @given( 2025-05-07T20:32:35.4492780Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.4493083Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.4493392Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.4493850Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.4494180Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.4494458Z ) 2025-05-07T20:32:35.4494802Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.4495229Z def test_silu_mul_quant( 2025-05-07T20:32:35.4495467Z self, 2025-05-07T20:32:35.4495658Z T: int, 2025-05-07T20:32:35.4495851Z D: int, 2025-05-07T20:32:35.4496064Z scale_ub: Optional[float], 2025-05-07T20:32:35.4496331Z contiguous: bool, 2025-05-07T20:32:35.4496562Z compiled: bool, 2025-05-07T20:32:35.4496785Z ) -> None: 2025-05-07T20:32:35.4497003Z torch.manual_seed(2025) 2025-05-07T20:32:35.4497232Z 2025-05-07T20:32:35.4497502Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.4499565Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:35.4501477Z 2025-05-07T20:32:35.4501598Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:35.4501815Z 2025-05-07T20:32:35.4501926Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.4502330Z self=, 2025-05-07T20:32:35.4502723Z T=128, 2025-05-07T20:32:35.4502903Z D=7168, 2025-05-07T20:32:35.4503085Z scale_ub=1200.0, 2025-05-07T20:32:35.4503301Z contiguous=True, 2025-05-07T20:32:35.4503589Z compiled=True, 2025-05-07T20:32:35.4503779Z ) 2025-05-07T20:32:35.4922147Z self = 2025-05-07T20:32:35.4922675Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:35.4922952Z 2025-05-07T20:32:35.4923034Z @given( 2025-05-07T20:32:35.4923270Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.4923572Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.4923878Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.4924321Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.4924647Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.4925009Z ) 2025-05-07T20:32:35.4925351Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.4925788Z def test_silu_mul_quant( 2025-05-07T20:32:35.4926028Z self, 2025-05-07T20:32:35.4926224Z T: int, 2025-05-07T20:32:35.4926425Z D: int, 2025-05-07T20:32:35.4926644Z scale_ub: Optional[float], 2025-05-07T20:32:35.4926912Z contiguous: bool, 2025-05-07T20:32:35.4927152Z compiled: bool, 2025-05-07T20:32:35.4927372Z ) -> None: 2025-05-07T20:32:35.4927586Z torch.manual_seed(2025) 2025-05-07T20:32:35.4927831Z 2025-05-07T20:32:35.4928103Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.4928440Z 2025-05-07T20:32:35.4928640Z x_sign = torch.sign(x) 2025-05-07T20:32:35.4928925Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.4929238Z x = x_sign * x_clamp 2025-05-07T20:32:35.4929487Z x0 = x[:, :D] 2025-05-07T20:32:35.4929702Z x1 = x[:, D:] 2025-05-07T20:32:35.4929910Z 2025-05-07T20:32:35.4930098Z if contiguous: 2025-05-07T20:32:35.4930328Z x0 = x0.contiguous() 2025-05-07T20:32:35.4930593Z x1 = x1.contiguous() 2025-05-07T20:32:35.4930844Z 2025-05-07T20:32:35.4931125Z if scale_ub is not None: 2025-05-07T20:32:35.4931424Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.4931763Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.4937960Z ) 2025-05-07T20:32:35.4938174Z else: 2025-05-07T20:32:35.4938399Z scale_ub_tensor = None 2025-05-07T20:32:35.4938672Z 2025-05-07T20:32:35.4938915Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.4939234Z op = silu_mul_quant 2025-05-07T20:32:35.4939503Z if compiled: 2025-05-07T20:32:35.4939843Z op = torch.compile(op) 2025-05-07T20:32:35.4940156Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.4940447Z 2025-05-07T20:32:35.4940650Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.4940826Z 2025-05-07T20:32:35.4940928Z moe/activation_test.py:117: 2025-05-07T20:32:35.4941233Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.4941584Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.4941866Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.4942436Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:35.4943002Z return fn(*args, **kwargs) 2025-05-07T20:32:35.4943666Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.4944360Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.4944909Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.4945600Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.4946264Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.4946909Z kernel = self.compile( 2025-05-07T20:32:35.4947462Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.4948125Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.4948521Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.4948759Z 2025-05-07T20:32:35.4948971Z self = 2025-05-07T20:32:35.4950086Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.4951510Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fcec122b7f0>} 2025-05-07T20:32:35.4952844Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.4953864Z context = 2025-05-07T20:32:35.4954156Z 2025-05-07T20:32:35.4954323Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.4954839Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.4955316Z module_map=module_map) 2025-05-07T20:32:35.4955685Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.4956042Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.4956300Z E ^ 2025-05-07T20:32:35.4956759Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.4957208Z 2025-05-07T20:32:35.4957671Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.4958178Z 2025-05-07T20:32:35.4958290Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.4958696Z self=, 2025-05-07T20:32:35.4959092Z T=128, 2025-05-07T20:32:35.4959279Z D=7168, 2025-05-07T20:32:35.4959471Z scale_ub=1200.0, 2025-05-07T20:32:35.4959715Z contiguous=True, 2025-05-07T20:32:35.4959939Z compiled=False, 2025-05-07T20:32:35.4960149Z ) 2025-05-07T20:32:35.4960462Z self = 2025-05-07T20:32:35.4960956Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:35.4961227Z 2025-05-07T20:32:35.4961304Z @given( 2025-05-07T20:32:35.4961540Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.4961847Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.4962163Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.4962489Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.4962809Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.4963095Z ) 2025-05-07T20:32:35.4963447Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.4963887Z def test_silu_mul_quant( 2025-05-07T20:32:35.4964129Z self, 2025-05-07T20:32:35.4964326Z T: int, 2025-05-07T20:32:35.4964523Z D: int, 2025-05-07T20:32:35.4964746Z scale_ub: Optional[float], 2025-05-07T20:32:35.4965018Z contiguous: bool, 2025-05-07T20:32:35.4965258Z compiled: bool, 2025-05-07T20:32:35.4965480Z ) -> None: 2025-05-07T20:32:35.4965701Z torch.manual_seed(2025) 2025-05-07T20:32:35.4965945Z 2025-05-07T20:32:35.4966214Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.4966613Z 2025-05-07T20:32:35.4966804Z x_sign = torch.sign(x) 2025-05-07T20:32:35.4967100Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.4969126Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:35.4970998Z 2025-05-07T20:32:35.4971123Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:35.4971334Z 2025-05-07T20:32:35.4971441Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.4971850Z self=, 2025-05-07T20:32:35.4972255Z T=128, 2025-05-07T20:32:35.4972441Z D=5120, 2025-05-07T20:32:35.4972632Z scale_ub=1200.0, 2025-05-07T20:32:35.4972861Z contiguous=True, 2025-05-07T20:32:35.4973082Z compiled=True, 2025-05-07T20:32:35.4973278Z ) 2025-05-07T20:32:35.4973594Z self = 2025-05-07T20:32:35.4974076Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:35.4974340Z 2025-05-07T20:32:35.4974418Z @given( 2025-05-07T20:32:35.4974649Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.4974958Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.4975267Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.4975587Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.4975908Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.4976187Z ) 2025-05-07T20:32:35.4976576Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.4977015Z def test_silu_mul_quant( 2025-05-07T20:32:35.4977255Z self, 2025-05-07T20:32:35.4977447Z T: int, 2025-05-07T20:32:35.4977643Z D: int, 2025-05-07T20:32:35.4977862Z scale_ub: Optional[float], 2025-05-07T20:32:35.4978128Z contiguous: bool, 2025-05-07T20:32:35.4978367Z compiled: bool, 2025-05-07T20:32:35.4978589Z ) -> None: 2025-05-07T20:32:35.4978804Z torch.manual_seed(2025) 2025-05-07T20:32:35.4979045Z 2025-05-07T20:32:35.4979313Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.4979648Z 2025-05-07T20:32:35.4979906Z x_sign = torch.sign(x) 2025-05-07T20:32:35.4980198Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.4982175Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:35.4984003Z 2025-05-07T20:32:35.4984125Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:35.4984338Z 2025-05-07T20:32:35.4984442Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.4984851Z self=, 2025-05-07T20:32:35.4985255Z T=128, 2025-05-07T20:32:35.4985440Z D=7168, 2025-05-07T20:32:35.4985634Z scale_ub=None, 2025-05-07T20:32:35.4985845Z contiguous=True, 2025-05-07T20:32:35.4986068Z compiled=True, 2025-05-07T20:32:35.4986316Z ) 2025-05-07T20:32:35.6981481Z self = 2025-05-07T20:32:35.6982015Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:35.6982288Z 2025-05-07T20:32:35.6982370Z @given( 2025-05-07T20:32:35.6982605Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.6982915Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.6983221Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.6983551Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.6984010Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.6984298Z ) 2025-05-07T20:32:35.6984718Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.6985158Z def test_silu_mul_quant( 2025-05-07T20:32:35.6985414Z self, 2025-05-07T20:32:35.6985617Z T: int, 2025-05-07T20:32:35.6985823Z D: int, 2025-05-07T20:32:35.6986057Z scale_ub: Optional[float], 2025-05-07T20:32:35.6986340Z contiguous: bool, 2025-05-07T20:32:35.6986580Z compiled: bool, 2025-05-07T20:32:35.6986814Z ) -> None: 2025-05-07T20:32:35.6987038Z torch.manual_seed(2025) 2025-05-07T20:32:35.6987289Z 2025-05-07T20:32:35.6987568Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.6989618Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:35.6991778Z 2025-05-07T20:32:35.6991993Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:35.6992211Z 2025-05-07T20:32:35.7002894Z FAILED 2025-05-07T20:32:35.7003177Z 2025-05-07T20:32:35.7003508Z =================================== FAILURES =================================== 2025-05-07T20:32:35.7004152Z _____________________ ActivationTests.test_silu_mul_quant ______________________ 2025-05-07T20:32:35.7004763Z + Exception Group Traceback (most recent call last): 2025-05-07T20:32:35.7005647Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/unittest/case.py", line 59, in testPartExecutor 2025-05-07T20:32:35.7006524Z | yield 2025-05-07T20:32:35.7007136Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/unittest/case.py", line 591, in run 2025-05-07T20:32:35.7007881Z | self._callTestMethod(testMethod) 2025-05-07T20:32:35.7008660Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/unittest/case.py", line 549, in _callTestMethod 2025-05-07T20:32:35.7009406Z | method() 2025-05-07T20:32:35.7010080Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant 2025-05-07T20:32:35.7010802Z | T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.7011560Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/hypothesis/core.py", line 1850, in wrapped_test 2025-05-07T20:32:35.7012347Z | raise the_error_hypothesis_found 2025-05-07T20:32:35.7013033Z | exceptiongroup.ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions) 2025-05-07T20:32:35.7013753Z +-+---------------- 1 ---------------- 2025-05-07T20:32:35.7014167Z | Traceback (most recent call last): 2025-05-07T20:32:35.7015144Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:32:35.7016436Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.7019294Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:35.7022465Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:35.7023093Z | self=, 2025-05-07T20:32:35.7023648Z | T=2048, 2025-05-07T20:32:35.7023971Z | D=5120, # or any other generated value 2025-05-07T20:32:35.7024451Z | scale_ub=None, # or any other generated value 2025-05-07T20:32:35.7024937Z | contiguous=True, # or any other generated value 2025-05-07T20:32:35.7025434Z | compiled=False, # or any other generated value 2025-05-07T20:32:35.7025842Z | ) 2025-05-07T20:32:35.7026090Z | 2025-05-07T20:32:35.7026815Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEECQQBBAEEAQQE=') as a decorator on your test case 2025-05-07T20:32:35.7027641Z +---------------- 2 ---------------- 2025-05-07T20:32:35.7028049Z | Traceback (most recent call last): 2025-05-07T20:32:35.7029023Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:32:35.7030087Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.7032923Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:35.7035664Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:35.7036300Z | self=, 2025-05-07T20:32:35.7036854Z | T=128, 2025-05-07T20:32:35.7037141Z | D=7168, 2025-05-07T20:32:35.7037440Z | scale_ub=None, 2025-05-07T20:32:35.7037771Z | contiguous=True, 2025-05-07T20:32:35.7038106Z | compiled=True, 2025-05-07T20:32:35.7038408Z | ) 2025-05-07T20:32:35.7038655Z | 2025-05-07T20:32:35.7039375Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case 2025-05-07T20:32:35.7040193Z +---------------- 3 ---------------- 2025-05-07T20:32:35.7040587Z | Traceback (most recent call last): 2025-05-07T20:32:35.7041464Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:32:35.7042235Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.7044250Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:35.7046252Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:35.7046680Z | self=, 2025-05-07T20:32:35.7047080Z | T=128, 2025-05-07T20:32:35.7047274Z | D=5120, 2025-05-07T20:32:35.7047481Z | scale_ub=1200.0, 2025-05-07T20:32:35.7047717Z | contiguous=True, 2025-05-07T20:32:35.7048004Z | compiled=True, 2025-05-07T20:32:35.7048218Z | ) 2025-05-07T20:32:35.7048392Z | 2025-05-07T20:32:35.7048954Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case 2025-05-07T20:32:35.7049557Z +---------------- 4 ---------------- 2025-05-07T20:32:35.7049840Z | Traceback (most recent call last): 2025-05-07T20:32:35.7050544Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant 2025-05-07T20:32:35.7051301Z | y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:35.7051945Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn 2025-05-07T20:32:35.7052630Z | return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:35.7053653Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row 2025-05-07T20:32:35.7054799Z | _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:35.7055659Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py", line 330, in 2025-05-07T20:32:35.7056711Z | return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.7057921Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 186, in run 2025-05-07T20:32:35.7059059Z | timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:35.7065346Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 186, in 2025-05-07T20:32:35.7066508Z | timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:35.7067596Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 166, in _bench 2025-05-07T20:32:35.7068553Z | return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:35.7069456Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py", line 117, in do_bench 2025-05-07T20:32:35.7070247Z | fn() 2025-05-07T20:32:35.7071034Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call 2025-05-07T20:32:35.7071896Z | self.fn.run( 2025-05-07T20:32:35.7072628Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py", line 623, in run 2025-05-07T20:32:35.7073420Z | kernel = self.compile( 2025-05-07T20:32:35.7074253Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 273, in compile 2025-05-07T20:32:35.7075231Z | module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.7076180Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:35.7077281Z | return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.7078126Z | triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.7078613Z | def _kernel_quantize_fp8_row( 2025-05-07T20:32:35.7078978Z | ^ 2025-05-07T20:32:35.7079602Z | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.7080384Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:35.7080945Z | # The test always failed when commented parts were varied together. 2025-05-07T20:32:35.7081761Z | self=, 2025-05-07T20:32:35.7082405Z | T=1, # or any other generated value 2025-05-07T20:32:35.7082841Z | D=5120, # or any other generated value 2025-05-07T20:32:35.7083308Z | scale_ub=None, # or any other generated value 2025-05-07T20:32:35.7083806Z | contiguous=True, # or any other generated value 2025-05-07T20:32:35.7084320Z | compiled=True, # or any other generated value 2025-05-07T20:32:35.7084732Z | ) 2025-05-07T20:32:35.7084985Z | 2025-05-07T20:32:35.7085696Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case 2025-05-07T20:32:35.7086516Z +------------------------------------ 2025-05-07T20:32:35.7086997Z ---------------------------------- Hypothesis ---------------------------------- 2025-05-07T20:32:35.7087503Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.7088052Z self=, 2025-05-07T20:32:35.7088585Z T=1, 2025-05-07T20:32:35.7088834Z D=5120, 2025-05-07T20:32:35.7089099Z scale_ub=None, 2025-05-07T20:32:35.7089396Z contiguous=True, 2025-05-07T20:32:35.7089687Z compiled=True, 2025-05-07T20:32:35.7090270Z ) 2025-05-07T20:32:35.7090707Z self = 2025-05-07T20:32:35.7091524Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:35.7091885Z 2025-05-07T20:32:35.7091993Z @given( 2025-05-07T20:32:35.7092310Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.7092730Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.7093136Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.7093576Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.7094027Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.7094431Z ) 2025-05-07T20:32:35.7094914Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.7095529Z def test_silu_mul_quant( 2025-05-07T20:32:35.7095865Z self, 2025-05-07T20:32:35.7096577Z T: int, 2025-05-07T20:32:35.7096847Z D: int, 2025-05-07T20:32:35.7097138Z scale_ub: Optional[float], 2025-05-07T20:32:35.7097513Z contiguous: bool, 2025-05-07T20:32:35.7097867Z compiled: bool, 2025-05-07T20:32:35.7098189Z ) -> None: 2025-05-07T20:32:35.7098488Z torch.manual_seed(2025) 2025-05-07T20:32:35.7098836Z 2025-05-07T20:32:35.7099214Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.7099681Z 2025-05-07T20:32:35.7100082Z x_sign = torch.sign(x) 2025-05-07T20:32:35.7100489Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.7100921Z x = x_sign * x_clamp 2025-05-07T20:32:35.7101259Z x0 = x[:, :D] 2025-05-07T20:32:35.7101565Z x1 = x[:, D:] 2025-05-07T20:32:35.7101847Z 2025-05-07T20:32:35.7102104Z if contiguous: 2025-05-07T20:32:35.7102425Z x0 = x0.contiguous() 2025-05-07T20:32:35.7102770Z x1 = x1.contiguous() 2025-05-07T20:32:35.7103096Z 2025-05-07T20:32:35.7103358Z if scale_ub is not None: 2025-05-07T20:32:35.7103732Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.7104283Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.7104707Z ) 2025-05-07T20:32:35.7104976Z else: 2025-05-07T20:32:35.7105266Z scale_ub_tensor = None 2025-05-07T20:32:35.7105616Z 2025-05-07T20:32:35.7105936Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.7106362Z op = silu_mul_quant 2025-05-07T20:32:35.7106711Z if compiled: 2025-05-07T20:32:35.7107689Z op = torch.compile(op) 2025-05-07T20:32:35.7108100Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.7108576Z 2025-05-07T20:32:35.7108836Z y_fp8, y_scale = fn() 2025-05-07T20:32:35.7109286Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:35.7109680Z 2025-05-07T20:32:35.7109999Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.7110427Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:35.7110813Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:35.7111274Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:35.7111732Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:35.7112128Z 2025-05-07T20:32:35.7112388Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:35.7112641Z 2025-05-07T20:32:35.7112775Z moe/activation_test.py:126: 2025-05-07T20:32:35.7113155Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.7113588Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:35.7114017Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:35.7135386Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:35.7136451Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:35.7137182Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.7138179Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.7139096Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:35.7140261Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:35.7141284Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:35.7142324Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:35.7143338Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:35.7144221Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:35.7145028Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:35.7145741Z fn() 2025-05-07T20:32:35.7146458Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:35.7147251Z self.fn.run( 2025-05-07T20:32:35.7147932Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.7148664Z kernel = self.compile( 2025-05-07T20:32:35.7149410Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.7150297Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.7150845Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.7151152Z 2025-05-07T20:32:35.7151439Z self = 2025-05-07T20:32:35.7152882Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.7154782Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd09d57caf0>} 2025-05-07T20:32:35.7156594Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.7158163Z context = 2025-05-07T20:32:35.7158555Z 2025-05-07T20:32:35.7158787Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.7159487Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.7160105Z module_map=module_map) 2025-05-07T20:32:35.7160579Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.7161039Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:35.7161379Z E ^ 2025-05-07T20:32:35.7161988Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.7162589Z 2025-05-07T20:32:35.7163158Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.7163835Z 2025-05-07T20:32:35.7163977Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.7164505Z self=, 2025-05-07T20:32:35.7165025Z T=2048, 2025-05-07T20:32:35.7165269Z D=5120, 2025-05-07T20:32:35.7165509Z scale_ub=1200.0, 2025-05-07T20:32:35.7165797Z contiguous=True, 2025-05-07T20:32:35.7166080Z compiled=False, 2025-05-07T20:32:35.7166357Z ) 2025-05-07T20:32:35.7166831Z self = 2025-05-07T20:32:35.7167468Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:35.7167810Z 2025-05-07T20:32:35.7167913Z @given( 2025-05-07T20:32:35.7168202Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.7168603Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.7169003Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.7169422Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.7169857Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.7170236Z ) 2025-05-07T20:32:35.7170682Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.7171251Z def test_silu_mul_quant( 2025-05-07T20:32:35.7171564Z self, 2025-05-07T20:32:35.7171804Z T: int, 2025-05-07T20:32:35.7172065Z D: int, 2025-05-07T20:32:35.7172351Z scale_ub: Optional[float], 2025-05-07T20:32:35.7172701Z contiguous: bool, 2025-05-07T20:32:35.7173013Z compiled: bool, 2025-05-07T20:32:35.7173323Z ) -> None: 2025-05-07T20:32:35.7173626Z torch.manual_seed(2025) 2025-05-07T20:32:35.7173958Z 2025-05-07T20:32:35.7174308Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.7174750Z 2025-05-07T20:32:35.7174989Z x_sign = torch.sign(x) 2025-05-07T20:32:35.7175370Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.7175772Z x = x_sign * x_clamp 2025-05-07T20:32:35.7176076Z x0 = x[:, :D] 2025-05-07T20:32:35.7176367Z x1 = x[:, D:] 2025-05-07T20:32:35.7176644Z 2025-05-07T20:32:35.7176878Z if contiguous: 2025-05-07T20:32:35.7177182Z x0 = x0.contiguous() 2025-05-07T20:32:35.7177513Z x1 = x1.contiguous() 2025-05-07T20:32:35.7177830Z 2025-05-07T20:32:35.7178138Z if scale_ub is not None: 2025-05-07T20:32:35.7178486Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.7178916Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.7179310Z ) 2025-05-07T20:32:35.7179552Z else: 2025-05-07T20:32:35.7179967Z scale_ub_tensor = None 2025-05-07T20:32:35.7180302Z 2025-05-07T20:32:35.7180595Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.7181028Z op = silu_mul_quant 2025-05-07T20:32:35.7181378Z if compiled: 2025-05-07T20:32:35.7181751Z op = torch.compile(op) 2025-05-07T20:32:35.7182137Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.7182538Z 2025-05-07T20:32:35.7182789Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.7182999Z 2025-05-07T20:32:35.7183125Z moe/activation_test.py:117: 2025-05-07T20:32:35.7183516Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.7183952Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.7184309Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.7185214Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.7186144Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.7186865Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.7187777Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.7188682Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.7189401Z kernel = self.compile( 2025-05-07T20:32:35.7190480Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.7191386Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.7192079Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.7192396Z 2025-05-07T20:32:35.7192682Z self = 2025-05-07T20:32:35.7194122Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.7195988Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd09d45d990>} 2025-05-07T20:32:35.7197867Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.7199266Z context = 2025-05-07T20:32:35.7199647Z 2025-05-07T20:32:35.7199866Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.7200564Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.7201198Z module_map=module_map) 2025-05-07T20:32:35.7201681Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.7202142Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.7202491Z E ^ 2025-05-07T20:32:35.7203103Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.7203717Z 2025-05-07T20:32:35.7204308Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.7205037Z 2025-05-07T20:32:35.7205179Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.7205839Z self=, 2025-05-07T20:32:35.7206402Z T=2048, 2025-05-07T20:32:35.7206655Z D=5120, 2025-05-07T20:32:35.7206918Z scale_ub=1200.0, 2025-05-07T20:32:35.7207232Z contiguous=True, 2025-05-07T20:32:35.7207534Z compiled=True, 2025-05-07T20:32:35.7207813Z ) 2025-05-07T20:32:35.7208245Z self = 2025-05-07T20:32:35.7208917Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:35.7209386Z 2025-05-07T20:32:35.7209491Z @given( 2025-05-07T20:32:35.7209805Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.7210302Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.7210725Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.7211217Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.7211680Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.7212080Z ) 2025-05-07T20:32:35.7212562Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.7213160Z def test_silu_mul_quant( 2025-05-07T20:32:35.7213501Z self, 2025-05-07T20:32:35.7213753Z T: int, 2025-05-07T20:32:35.7214017Z D: int, 2025-05-07T20:32:35.7214313Z scale_ub: Optional[float], 2025-05-07T20:32:35.7214675Z contiguous: bool, 2025-05-07T20:32:35.7214998Z compiled: bool, 2025-05-07T20:32:35.7215297Z ) -> None: 2025-05-07T20:32:35.7215601Z torch.manual_seed(2025) 2025-05-07T20:32:35.7215915Z 2025-05-07T20:32:35.7216270Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.7216697Z 2025-05-07T20:32:35.7216955Z x_sign = torch.sign(x) 2025-05-07T20:32:35.7217315Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.7217733Z x = x_sign * x_clamp 2025-05-07T20:32:35.7218082Z x0 = x[:, :D] 2025-05-07T20:32:35.7218450Z x1 = x[:, D:] 2025-05-07T20:32:35.7218740Z 2025-05-07T20:32:35.7218996Z if contiguous: 2025-05-07T20:32:35.7219318Z x0 = x0.contiguous() 2025-05-07T20:32:35.7219670Z x1 = x1.contiguous() 2025-05-07T20:32:35.7220149Z 2025-05-07T20:32:35.7220426Z if scale_ub is not None: 2025-05-07T20:32:35.7220800Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.7221158Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.7221472Z ) 2025-05-07T20:32:35.7221669Z else: 2025-05-07T20:32:35.7221877Z scale_ub_tensor = None 2025-05-07T20:32:35.7222135Z 2025-05-07T20:32:35.7222374Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.7222684Z op = silu_mul_quant 2025-05-07T20:32:35.7222940Z if compiled: 2025-05-07T20:32:35.7223193Z op = torch.compile(op) 2025-05-07T20:32:35.7223490Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.7223768Z 2025-05-07T20:32:35.7223967Z y_fp8, y_scale = fn() 2025-05-07T20:32:35.7224255Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:35.7224547Z 2025-05-07T20:32:35.7224787Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.7225116Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:35.7225413Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:35.7225730Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:35.7226096Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:35.7226406Z 2025-05-07T20:32:35.7226618Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:35.7226816Z 2025-05-07T20:32:35.7226924Z moe/activation_test.py:126: 2025-05-07T20:32:35.7227220Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.7227556Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:35.7227958Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:35.7228745Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:35.7229489Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:35.7230039Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.7230720Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.7231512Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:35.7232237Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:35.7232987Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:35.7233738Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:35.7234463Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:35.7235098Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:35.7235715Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:35.7236243Z fn() 2025-05-07T20:32:35.7236746Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:35.7237338Z self.fn.run( 2025-05-07T20:32:35.7237808Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.7238337Z kernel = self.compile( 2025-05-07T20:32:35.7238915Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.7239584Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.7239980Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.7240208Z 2025-05-07T20:32:35.7240424Z self = 2025-05-07T20:32:35.7241488Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.7242858Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd097e2d3f0>} 2025-05-07T20:32:35.7244195Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.7245216Z context = 2025-05-07T20:32:35.7245501Z 2025-05-07T20:32:35.7245666Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.7246187Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.7246655Z module_map=module_map) 2025-05-07T20:32:35.7247024Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.7247381Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:35.7247649Z E ^ 2025-05-07T20:32:35.7248112Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.7248554Z 2025-05-07T20:32:35.7248973Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.7250098Z 2025-05-07T20:32:35.7250202Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.7250619Z self=, 2025-05-07T20:32:35.7251024Z T=16384, 2025-05-07T20:32:35.7251215Z D=7168, 2025-05-07T20:32:35.7251416Z scale_ub=1200.0, 2025-05-07T20:32:35.7251650Z contiguous=False, 2025-05-07T20:32:35.7251873Z compiled=False, 2025-05-07T20:32:35.7252082Z ) 2025-05-07T20:32:35.7252402Z self = 2025-05-07T20:32:35.7252943Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:35.7253230Z 2025-05-07T20:32:35.7253349Z @given( 2025-05-07T20:32:35.7253588Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.7253907Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.7254209Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.7254552Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.7254880Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.7255160Z ) 2025-05-07T20:32:35.7255513Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.7255959Z def test_silu_mul_quant( 2025-05-07T20:32:35.7256205Z self, 2025-05-07T20:32:35.7256406Z T: int, 2025-05-07T20:32:35.7256608Z D: int, 2025-05-07T20:32:35.7256823Z scale_ub: Optional[float], 2025-05-07T20:32:35.7257097Z contiguous: bool, 2025-05-07T20:32:35.7257344Z compiled: bool, 2025-05-07T20:32:35.7257563Z ) -> None: 2025-05-07T20:32:35.7257786Z torch.manual_seed(2025) 2025-05-07T20:32:35.7258034Z 2025-05-07T20:32:35.7258308Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.7258645Z 2025-05-07T20:32:35.7258842Z x_sign = torch.sign(x) 2025-05-07T20:32:35.7259184Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.7259495Z x = x_sign * x_clamp 2025-05-07T20:32:35.7259742Z x0 = x[:, :D] 2025-05-07T20:32:35.7260091Z x1 = x[:, D:] 2025-05-07T20:32:35.7260294Z 2025-05-07T20:32:35.7260485Z if contiguous: 2025-05-07T20:32:35.7260717Z x0 = x0.contiguous() 2025-05-07T20:32:35.7260970Z x1 = x1.contiguous() 2025-05-07T20:32:35.7261213Z 2025-05-07T20:32:35.7261409Z if scale_ub is not None: 2025-05-07T20:32:35.7261678Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.7262017Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.7262325Z ) 2025-05-07T20:32:35.7262517Z else: 2025-05-07T20:32:35.7262731Z scale_ub_tensor = None 2025-05-07T20:32:35.7262984Z 2025-05-07T20:32:35.7263215Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.7263521Z op = silu_mul_quant 2025-05-07T20:32:35.7263774Z if compiled: 2025-05-07T20:32:35.7264024Z op = torch.compile(op) 2025-05-07T20:32:35.7264316Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.7264593Z 2025-05-07T20:32:35.7264787Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.7264950Z 2025-05-07T20:32:35.7265051Z moe/activation_test.py:117: 2025-05-07T20:32:35.7265345Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.7265678Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.7265956Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.7266644Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.7267334Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.7267870Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.7268542Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.7269262Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.7269799Z kernel = self.compile( 2025-05-07T20:32:35.7270336Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.7270994Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.7271425Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.7271694Z 2025-05-07T20:32:35.7271948Z self = 2025-05-07T20:32:35.7273019Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.7274402Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd097e2ce50>} 2025-05-07T20:32:35.7275733Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.7276748Z context = 2025-05-07T20:32:35.7277038Z 2025-05-07T20:32:35.7277215Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.7277730Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.7278196Z module_map=module_map) 2025-05-07T20:32:35.7278561Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.7278909Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.7279169Z E ^ 2025-05-07T20:32:35.7279673Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.7280128Z 2025-05-07T20:32:35.7280545Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.7281051Z 2025-05-07T20:32:35.7281157Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.7281566Z self=, 2025-05-07T20:32:35.7281967Z T=1, 2025-05-07T20:32:35.7282154Z D=7168, 2025-05-07T20:32:35.7282344Z scale_ub=None, 2025-05-07T20:32:35.7282563Z contiguous=True, 2025-05-07T20:32:35.7282787Z compiled=True, 2025-05-07T20:32:35.7282985Z ) 2025-05-07T20:32:35.7283305Z self = 2025-05-07T20:32:35.7283784Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:35.7284042Z 2025-05-07T20:32:35.7284122Z @given( 2025-05-07T20:32:35.7284354Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.7284667Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.7284967Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.7285293Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.7285619Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.7285902Z ) 2025-05-07T20:32:35.7286245Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.7286684Z def test_silu_mul_quant( 2025-05-07T20:32:35.7286924Z self, 2025-05-07T20:32:35.7287114Z T: int, 2025-05-07T20:32:35.7287311Z D: int, 2025-05-07T20:32:35.7287531Z scale_ub: Optional[float], 2025-05-07T20:32:35.7287795Z contiguous: bool, 2025-05-07T20:32:35.7288038Z compiled: bool, 2025-05-07T20:32:35.7288311Z ) -> None: 2025-05-07T20:32:35.7288526Z torch.manual_seed(2025) 2025-05-07T20:32:35.7288769Z 2025-05-07T20:32:35.7289039Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.7289371Z 2025-05-07T20:32:35.7289564Z x_sign = torch.sign(x) 2025-05-07T20:32:35.7290100Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.7290473Z x = x_sign * x_clamp 2025-05-07T20:32:35.7290715Z x0 = x[:, :D] 2025-05-07T20:32:35.7290932Z x1 = x[:, D:] 2025-05-07T20:32:35.7291135Z 2025-05-07T20:32:35.7291444Z if contiguous: 2025-05-07T20:32:35.7291684Z x0 = x0.contiguous() 2025-05-07T20:32:35.7292001Z x1 = x1.contiguous() 2025-05-07T20:32:35.7292245Z 2025-05-07T20:32:35.7292439Z if scale_ub is not None: 2025-05-07T20:32:35.7292715Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.7293043Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.7293361Z ) 2025-05-07T20:32:35.7293558Z else: 2025-05-07T20:32:35.7293769Z scale_ub_tensor = None 2025-05-07T20:32:35.7294021Z 2025-05-07T20:32:35.7294254Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.7294559Z op = silu_mul_quant 2025-05-07T20:32:35.7294811Z if compiled: 2025-05-07T20:32:35.7295060Z op = torch.compile(op) 2025-05-07T20:32:35.7295348Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.7295620Z 2025-05-07T20:32:35.7295821Z y_fp8, y_scale = fn() 2025-05-07T20:32:35.7296101Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:35.7296394Z 2025-05-07T20:32:35.7296638Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.7296976Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:35.7297264Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:35.7297576Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:35.7298004Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:35.7298310Z 2025-05-07T20:32:35.7298515Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:35.7298707Z 2025-05-07T20:32:35.7298814Z moe/activation_test.py:126: 2025-05-07T20:32:35.7299111Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.7299443Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:35.7299877Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:35.7300676Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:35.7301420Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:35.7301962Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.7302641Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.7303325Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:35.7304035Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:35.7304783Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:35.7305524Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:35.7306242Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:35.7306879Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:35.7307474Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:35.7307989Z fn() 2025-05-07T20:32:35.7308596Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:35.7309183Z self.fn.run( 2025-05-07T20:32:35.7309649Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.7310174Z kernel = self.compile( 2025-05-07T20:32:35.7310712Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.7311363Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.7311801Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.7312064Z 2025-05-07T20:32:35.7319541Z self = 2025-05-07T20:32:35.7320660Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.7322033Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd097bc5000>} 2025-05-07T20:32:35.7323382Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.7324403Z context = 2025-05-07T20:32:35.7324690Z 2025-05-07T20:32:35.7324865Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.7325386Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.7325857Z module_map=module_map) 2025-05-07T20:32:35.7326307Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.7326666Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:35.7326936Z E ^ 2025-05-07T20:32:35.7327404Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.7327847Z 2025-05-07T20:32:35.7328268Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.7328774Z 2025-05-07T20:32:35.7328878Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.7329297Z self=, 2025-05-07T20:32:35.7329696Z T=4096, 2025-05-07T20:32:35.7329886Z D=5120, 2025-05-07T20:32:35.7330081Z scale_ub=None, 2025-05-07T20:32:35.7330300Z contiguous=False, 2025-05-07T20:32:35.7330528Z compiled=False, 2025-05-07T20:32:35.7330734Z ) 2025-05-07T20:32:35.7331084Z self = 2025-05-07T20:32:35.7331609Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:35.7331881Z 2025-05-07T20:32:35.7331961Z @given( 2025-05-07T20:32:35.7332197Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.7332517Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.7332820Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.7333150Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.7333482Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.7333762Z ) 2025-05-07T20:32:35.7334118Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.7334559Z def test_silu_mul_quant( 2025-05-07T20:32:35.7334804Z self, 2025-05-07T20:32:35.7334998Z T: int, 2025-05-07T20:32:35.7335202Z D: int, 2025-05-07T20:32:35.7335427Z scale_ub: Optional[float], 2025-05-07T20:32:35.7335752Z contiguous: bool, 2025-05-07T20:32:35.7336000Z compiled: bool, 2025-05-07T20:32:35.7336235Z ) -> None: 2025-05-07T20:32:35.7336454Z torch.manual_seed(2025) 2025-05-07T20:32:35.7336703Z 2025-05-07T20:32:35.7336982Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.7337316Z 2025-05-07T20:32:35.7337514Z x_sign = torch.sign(x) 2025-05-07T20:32:35.7337813Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.7338117Z x = x_sign * x_clamp 2025-05-07T20:32:35.7338412Z x0 = x[:, :D] 2025-05-07T20:32:35.7338632Z x1 = x[:, D:] 2025-05-07T20:32:35.7338836Z 2025-05-07T20:32:35.7339068Z if contiguous: 2025-05-07T20:32:35.7339310Z x0 = x0.contiguous() 2025-05-07T20:32:35.7339566Z x1 = x1.contiguous() 2025-05-07T20:32:35.7340547Z 2025-05-07T20:32:35.7340747Z if scale_ub is not None: 2025-05-07T20:32:35.7341024Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.7341361Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.7341668Z ) 2025-05-07T20:32:35.7341864Z else: 2025-05-07T20:32:35.7342074Z scale_ub_tensor = None 2025-05-07T20:32:35.7342323Z 2025-05-07T20:32:35.7342557Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.7342866Z op = silu_mul_quant 2025-05-07T20:32:35.7343120Z if compiled: 2025-05-07T20:32:35.7343368Z op = torch.compile(op) 2025-05-07T20:32:35.7343673Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.7343942Z 2025-05-07T20:32:35.7344139Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.7344306Z 2025-05-07T20:32:35.7344412Z moe/activation_test.py:117: 2025-05-07T20:32:35.7344702Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.7345034Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.7345319Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.7346054Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.7346744Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.7347286Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.7347961Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.7348617Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.7349157Z kernel = self.compile( 2025-05-07T20:32:35.7349698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.7350343Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.7350732Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.7350961Z 2025-05-07T20:32:35.7351164Z self = 2025-05-07T20:32:35.7352224Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.7353578Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd097bc5a20>} 2025-05-07T20:32:35.7354922Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.7355926Z context = 2025-05-07T20:32:35.7356260Z 2025-05-07T20:32:35.7356432Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.7356948Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.7357399Z module_map=module_map) 2025-05-07T20:32:35.7357762Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.7358108Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.7358360Z E ^ 2025-05-07T20:32:35.7358812Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.7359309Z 2025-05-07T20:32:35.7359754Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.7360266Z 2025-05-07T20:32:35.7360373Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.7360777Z self=, 2025-05-07T20:32:35.7361175Z T=4096, 2025-05-07T20:32:35.7361363Z D=7168, 2025-05-07T20:32:35.7361550Z scale_ub=None, 2025-05-07T20:32:35.7361756Z contiguous=False, 2025-05-07T20:32:35.7361978Z compiled=False, 2025-05-07T20:32:35.7362176Z ) 2025-05-07T20:32:35.7362483Z self = 2025-05-07T20:32:35.7362971Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:35.7363241Z 2025-05-07T20:32:35.7363322Z @given( 2025-05-07T20:32:35.7363549Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.7363854Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.7364157Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.7364489Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.7364803Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.7365081Z ) 2025-05-07T20:32:35.7365466Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.7365907Z def test_silu_mul_quant( 2025-05-07T20:32:35.7366151Z self, 2025-05-07T20:32:35.7366340Z T: int, 2025-05-07T20:32:35.7366528Z D: int, 2025-05-07T20:32:35.7366739Z scale_ub: Optional[float], 2025-05-07T20:32:35.7367005Z contiguous: bool, 2025-05-07T20:32:35.7367235Z compiled: bool, 2025-05-07T20:32:35.7367449Z ) -> None: 2025-05-07T20:32:35.7367663Z torch.manual_seed(2025) 2025-05-07T20:32:35.7367895Z 2025-05-07T20:32:35.7368164Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.7368497Z 2025-05-07T20:32:35.7368684Z x_sign = torch.sign(x) 2025-05-07T20:32:35.7368971Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.7369272Z x = x_sign * x_clamp 2025-05-07T20:32:35.7369505Z x0 = x[:, :D] 2025-05-07T20:32:35.7369708Z x1 = x[:, D:] 2025-05-07T20:32:35.7369915Z 2025-05-07T20:32:35.7370100Z if contiguous: 2025-05-07T20:32:35.7370325Z x0 = x0.contiguous() 2025-05-07T20:32:35.7370575Z x1 = x1.contiguous() 2025-05-07T20:32:35.7370805Z 2025-05-07T20:32:35.7370987Z if scale_ub is not None: 2025-05-07T20:32:35.7371255Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.7371582Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.7371874Z ) 2025-05-07T20:32:35.7372063Z else: 2025-05-07T20:32:35.7372272Z scale_ub_tensor = None 2025-05-07T20:32:35.7372516Z 2025-05-07T20:32:35.7372742Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.7373051Z op = silu_mul_quant 2025-05-07T20:32:35.7373290Z if compiled: 2025-05-07T20:32:35.7373532Z op = torch.compile(op) 2025-05-07T20:32:35.7373822Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.7374143Z 2025-05-07T20:32:35.7374330Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.7374499Z 2025-05-07T20:32:35.7374596Z moe/activation_test.py:117: 2025-05-07T20:32:35.7374888Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.7375211Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.7375491Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.7376165Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.7376894Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.7377462Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.7378143Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.7378798Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.7379317Z kernel = self.compile( 2025-05-07T20:32:35.7379978Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.7380632Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.7381024Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.7381246Z 2025-05-07T20:32:35.7381446Z self = 2025-05-07T20:32:35.7382505Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.7383875Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd097bc6560>} 2025-05-07T20:32:35.7385268Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.7386279Z context = 2025-05-07T20:32:35.7386565Z 2025-05-07T20:32:35.7386729Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.7387239Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.7387696Z module_map=module_map) 2025-05-07T20:32:35.7388053Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.7388406Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.7388658Z E ^ 2025-05-07T20:32:35.7389109Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.7389558Z 2025-05-07T20:32:35.7390252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.7390789Z 2025-05-07T20:32:35.7390894Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.7391299Z self=, 2025-05-07T20:32:35.7391686Z T=128, 2025-05-07T20:32:35.7391869Z D=7168, 2025-05-07T20:32:35.7392060Z scale_ub=None, 2025-05-07T20:32:35.7392271Z contiguous=False, 2025-05-07T20:32:35.7392491Z compiled=True, 2025-05-07T20:32:35.7392697Z ) 2025-05-07T20:32:35.7393006Z self = 2025-05-07T20:32:35.7393496Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:35.7393764Z 2025-05-07T20:32:35.7393841Z @given( 2025-05-07T20:32:35.7394069Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.7394378Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.7394783Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.7395112Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.7395436Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.7395708Z ) 2025-05-07T20:32:35.7396055Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.7396492Z def test_silu_mul_quant( 2025-05-07T20:32:35.7396724Z self, 2025-05-07T20:32:35.7396913Z T: int, 2025-05-07T20:32:35.7397176Z D: int, 2025-05-07T20:32:35.7397388Z scale_ub: Optional[float], 2025-05-07T20:32:35.7397659Z contiguous: bool, 2025-05-07T20:32:35.7397948Z compiled: bool, 2025-05-07T20:32:35.7398167Z ) -> None: 2025-05-07T20:32:35.7398382Z torch.manual_seed(2025) 2025-05-07T20:32:35.7398620Z 2025-05-07T20:32:35.7398881Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.7399218Z 2025-05-07T20:32:35.7399411Z x_sign = torch.sign(x) 2025-05-07T20:32:35.7399698Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.7399998Z x = x_sign * x_clamp 2025-05-07T20:32:35.7400237Z x0 = x[:, :D] 2025-05-07T20:32:35.7400450Z x1 = x[:, D:] 2025-05-07T20:32:35.7400646Z 2025-05-07T20:32:35.7400828Z if contiguous: 2025-05-07T20:32:35.7401058Z x0 = x0.contiguous() 2025-05-07T20:32:35.7401308Z x1 = x1.contiguous() 2025-05-07T20:32:35.7401542Z 2025-05-07T20:32:35.7401735Z if scale_ub is not None: 2025-05-07T20:32:35.7401999Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.7402329Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.7402627Z ) 2025-05-07T20:32:35.7402809Z else: 2025-05-07T20:32:35.7403019Z scale_ub_tensor = None 2025-05-07T20:32:35.7403265Z 2025-05-07T20:32:35.7403557Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.7403869Z op = silu_mul_quant 2025-05-07T20:32:35.7404117Z if compiled: 2025-05-07T20:32:35.7404362Z op = torch.compile(op) 2025-05-07T20:32:35.7404649Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.7404916Z 2025-05-07T20:32:35.7405109Z y_fp8, y_scale = fn() 2025-05-07T20:32:35.7405384Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:35.7405667Z 2025-05-07T20:32:35.7405904Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.7406232Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:35.7406526Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:35.7406840Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:35.7407187Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:35.7407493Z 2025-05-07T20:32:35.7407693Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:35.7407887Z 2025-05-07T20:32:35.7407992Z moe/activation_test.py:126: 2025-05-07T20:32:35.7408280Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.7408609Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:35.7408931Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:35.7409700Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:35.7409807Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:35.7410171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.7410393Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.7410754Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:35.7411060Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:35.7411460Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:35.7411708Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:35.7412083Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:35.7412246Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:35.7412666Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:35.7412748Z fn() 2025-05-07T20:32:35.7413143Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:35.7413223Z self.fn.run( 2025-05-07T20:32:35.7413568Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.7413660Z kernel = self.compile( 2025-05-07T20:32:35.7414039Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.7414212Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.7414338Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.7414343Z 2025-05-07T20:32:35.7414547Z self = 2025-05-07T20:32:35.7415333Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.7415869Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd097bca680>} 2025-05-07T20:32:35.7416619Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.7416809Z context = 2025-05-07T20:32:35.7416813Z 2025-05-07T20:32:35.7416982Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.7417246Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.7417362Z module_map=module_map) 2025-05-07T20:32:35.7417522Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.7417621Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:35.7417701Z E ^ 2025-05-07T20:32:35.7418058Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.7418065Z 2025-05-07T20:32:35.7418479Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.7418489Z 2025-05-07T20:32:35.7418593Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.7418812Z self=, 2025-05-07T20:32:35.7418890Z T=128, 2025-05-07T20:32:35.7418965Z D=7168, 2025-05-07T20:32:35.7419050Z scale_ub=None, 2025-05-07T20:32:35.7419141Z contiguous=False, 2025-05-07T20:32:35.7419222Z compiled=False, 2025-05-07T20:32:35.7419290Z ) 2025-05-07T20:32:35.7419517Z self = 2025-05-07T20:32:35.7419688Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:35.7419693Z 2025-05-07T20:32:35.7419887Z @given( 2025-05-07T20:32:35.7420069Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.7420166Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.7420285Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.7420401Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.7420515Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.7420592Z ) 2025-05-07T20:32:35.7420834Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.7420926Z def test_silu_mul_quant( 2025-05-07T20:32:35.7421048Z self, 2025-05-07T20:32:35.7421125Z T: int, 2025-05-07T20:32:35.7421200Z D: int, 2025-05-07T20:32:35.7421339Z scale_ub: Optional[float], 2025-05-07T20:32:35.7421429Z contiguous: bool, 2025-05-07T20:32:35.7421519Z compiled: bool, 2025-05-07T20:32:35.7421595Z ) -> None: 2025-05-07T20:32:35.7421687Z torch.manual_seed(2025) 2025-05-07T20:32:35.7421760Z 2025-05-07T20:32:35.7421927Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.7421997Z 2025-05-07T20:32:35.7422093Z x_sign = torch.sign(x) 2025-05-07T20:32:35.7422216Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.7422300Z x = x_sign * x_clamp 2025-05-07T20:32:35.7422384Z x0 = x[:, :D] 2025-05-07T20:32:35.7422461Z x1 = x[:, D:] 2025-05-07T20:32:35.7422531Z 2025-05-07T20:32:35.7422619Z if contiguous: 2025-05-07T20:32:35.7422710Z x0 = x0.contiguous() 2025-05-07T20:32:35.7422806Z x1 = x1.contiguous() 2025-05-07T20:32:35.7422877Z 2025-05-07T20:32:35.7422968Z if scale_ub is not None: 2025-05-07T20:32:35.7423080Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.7423215Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.7423288Z ) 2025-05-07T20:32:35.7423364Z else: 2025-05-07T20:32:35.7423506Z scale_ub_tensor = None 2025-05-07T20:32:35.7423581Z 2025-05-07T20:32:35.7423712Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.7423804Z op = silu_mul_quant 2025-05-07T20:32:35.7423887Z if compiled: 2025-05-07T20:32:35.7423993Z op = torch.compile(op) 2025-05-07T20:32:35.7424098Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.7424170Z 2025-05-07T20:32:35.7424264Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.7424268Z 2025-05-07T20:32:35.7424364Z moe/activation_test.py:117: 2025-05-07T20:32:35.7424499Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.7424602Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.7424699Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.7425197Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.7425297Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.7425659Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.7425877Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.7426211Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.7426308Z kernel = self.compile( 2025-05-07T20:32:35.7426684Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.7426860Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.7426991Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.7426996Z 2025-05-07T20:32:35.7427197Z self = 2025-05-07T20:32:35.7427979Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.7428521Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd097c25f30>} 2025-05-07T20:32:35.7429271Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.7429558Z context = 2025-05-07T20:32:35.7429563Z 2025-05-07T20:32:35.7429729Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.7429991Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.7430103Z module_map=module_map) 2025-05-07T20:32:35.7430262Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.7430365Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.7430437Z E ^ 2025-05-07T20:32:35.7430790Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.7430795Z 2025-05-07T20:32:35.7431203Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.7431211Z 2025-05-07T20:32:35.7431312Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.7431538Z self=, 2025-05-07T20:32:35.7431610Z T=4096, 2025-05-07T20:32:35.7431684Z D=5120, 2025-05-07T20:32:35.7431764Z scale_ub=1200.0, 2025-05-07T20:32:35.7431844Z contiguous=True, 2025-05-07T20:32:35.7431931Z compiled=False, 2025-05-07T20:32:35.7432043Z ) 2025-05-07T20:32:35.7432259Z self = 2025-05-07T20:32:35.7432435Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:35.7432439Z 2025-05-07T20:32:35.7432515Z @given( 2025-05-07T20:32:35.7432630Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.7432734Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.7432849Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.7432972Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.7433087Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.7433160Z ) 2025-05-07T20:32:35.7433412Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.7433504Z def test_silu_mul_quant( 2025-05-07T20:32:35.7433576Z self, 2025-05-07T20:32:35.7433660Z T: int, 2025-05-07T20:32:35.7433739Z D: int, 2025-05-07T20:32:35.7433838Z scale_ub: Optional[float], 2025-05-07T20:32:35.7433934Z contiguous: bool, 2025-05-07T20:32:35.7434020Z compiled: bool, 2025-05-07T20:32:35.7434097Z ) -> None: 2025-05-07T20:32:35.7434193Z torch.manual_seed(2025) 2025-05-07T20:32:35.7434260Z 2025-05-07T20:32:35.7434432Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.7434504Z 2025-05-07T20:32:35.7434594Z x_sign = torch.sign(x) 2025-05-07T20:32:35.7434723Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.7434810Z x = x_sign * x_clamp 2025-05-07T20:32:35.7434889Z x0 = x[:, :D] 2025-05-07T20:32:35.7434971Z x1 = x[:, D:] 2025-05-07T20:32:35.7435044Z 2025-05-07T20:32:35.7435127Z if contiguous: 2025-05-07T20:32:35.7435225Z x0 = x0.contiguous() 2025-05-07T20:32:35.7435311Z x1 = x1.contiguous() 2025-05-07T20:32:35.7435424Z 2025-05-07T20:32:35.7435524Z if scale_ub is not None: 2025-05-07T20:32:35.7435628Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.7435763Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.7435840Z ) 2025-05-07T20:32:35.7435913Z else: 2025-05-07T20:32:35.7436012Z scale_ub_tensor = None 2025-05-07T20:32:35.7436081Z 2025-05-07T20:32:35.7436213Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.7436312Z op = silu_mul_quant 2025-05-07T20:32:35.7436439Z if compiled: 2025-05-07T20:32:35.7436538Z op = torch.compile(op) 2025-05-07T20:32:35.7436686Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.7436756Z 2025-05-07T20:32:35.7436847Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.7436851Z 2025-05-07T20:32:35.7436955Z moe/activation_test.py:117: 2025-05-07T20:32:35.7437082Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.7437190Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.7437289Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.7437784Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.7437887Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.7438247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.7438469Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.7438822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.7438915Z kernel = self.compile( 2025-05-07T20:32:35.7439300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.7439517Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.7439641Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.7439645Z 2025-05-07T20:32:35.7439855Z self = 2025-05-07T20:32:35.7440618Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.7441123Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd097c25b40>} 2025-05-07T20:32:35.7441868Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.7442058Z context = 2025-05-07T20:32:35.7442068Z 2025-05-07T20:32:35.7442233Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.7442493Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.7442604Z module_map=module_map) 2025-05-07T20:32:35.7442763Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.7442860Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.7442943Z E ^ 2025-05-07T20:32:35.7443294Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.7443298Z 2025-05-07T20:32:35.7443719Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.7443723Z 2025-05-07T20:32:35.7443871Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.7444095Z self=, 2025-05-07T20:32:35.7444175Z T=1, 2025-05-07T20:32:35.7444248Z D=5120, 2025-05-07T20:32:35.7444328Z scale_ub=None, 2025-05-07T20:32:35.7444416Z contiguous=True, 2025-05-07T20:32:35.7444495Z compiled=True, 2025-05-07T20:32:35.7444563Z ) 2025-05-07T20:32:35.7444788Z self = 2025-05-07T20:32:35.7444946Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:35.7445023Z 2025-05-07T20:32:35.7445096Z @given( 2025-05-07T20:32:35.7445254Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.7445351Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.7445468Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.7445581Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.7445698Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.7445773Z ) 2025-05-07T20:32:35.7446022Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.7446118Z def test_silu_mul_quant( 2025-05-07T20:32:35.7446193Z self, 2025-05-07T20:32:35.7446265Z T: int, 2025-05-07T20:32:35.7446340Z D: int, 2025-05-07T20:32:35.7446438Z scale_ub: Optional[float], 2025-05-07T20:32:35.7446527Z contiguous: bool, 2025-05-07T20:32:35.7446614Z compiled: bool, 2025-05-07T20:32:35.7446693Z ) -> None: 2025-05-07T20:32:35.7446788Z torch.manual_seed(2025) 2025-05-07T20:32:35.7446857Z 2025-05-07T20:32:35.7447033Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.7447103Z 2025-05-07T20:32:35.7447196Z x_sign = torch.sign(x) 2025-05-07T20:32:35.7447317Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.7447403Z x = x_sign * x_clamp 2025-05-07T20:32:35.7447529Z x0 = x[:, :D] 2025-05-07T20:32:35.7447608Z x1 = x[:, D:] 2025-05-07T20:32:35.7447683Z 2025-05-07T20:32:35.7447766Z if contiguous: 2025-05-07T20:32:35.7447859Z x0 = x0.contiguous() 2025-05-07T20:32:35.7447951Z x1 = x1.contiguous() 2025-05-07T20:32:35.7448021Z 2025-05-07T20:32:35.7448110Z if scale_ub is not None: 2025-05-07T20:32:35.7448217Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.7448350Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.7448427Z ) 2025-05-07T20:32:35.7448504Z else: 2025-05-07T20:32:35.7448598Z scale_ub_tensor = None 2025-05-07T20:32:35.7448670Z 2025-05-07T20:32:35.7448805Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.7448892Z op = silu_mul_quant 2025-05-07T20:32:35.7448978Z if compiled: 2025-05-07T20:32:35.7449075Z op = torch.compile(op) 2025-05-07T20:32:35.7449187Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.7449257Z 2025-05-07T20:32:35.7449348Z y_fp8, y_scale = fn() 2025-05-07T20:32:35.7449468Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:35.7449540Z 2025-05-07T20:32:35.7449674Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.7449777Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:35.7449880Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:35.7450001Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:35.7450141Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:35.7450220Z 2025-05-07T20:32:35.7450321Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:35.7450325Z 2025-05-07T20:32:35.7450425Z moe/activation_test.py:126: 2025-05-07T20:32:35.7450550Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.7450711Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:35.7450854Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:35.7451411Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:35.7460896Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:35.7461289Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.7461515Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.7462012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:35.7462272Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:35.7462675Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:35.7462933Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:35.7463318Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:35.7463484Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:35.7463821Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:35.7463900Z fn() 2025-05-07T20:32:35.7464298Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:35.7464392Z self.fn.run( 2025-05-07T20:32:35.7464731Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.7464827Z kernel = self.compile( 2025-05-07T20:32:35.7465264Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.7465443Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.7465570Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.7465575Z 2025-05-07T20:32:35.7465790Z self = 2025-05-07T20:32:35.7466563Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.7467071Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd097c271c0>} 2025-05-07T20:32:35.7467826Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.7468020Z context = 2025-05-07T20:32:35.7468025Z 2025-05-07T20:32:35.7468189Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.7468451Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.7468567Z module_map=module_map) 2025-05-07T20:32:35.7468731Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.7468836Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:35.7468917Z E ^ 2025-05-07T20:32:35.7469270Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.7469275Z 2025-05-07T20:32:35.7469691Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.7469739Z 2025-05-07T20:32:35.7469846Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.7470065Z self=, 2025-05-07T20:32:35.7470148Z T=2048, 2025-05-07T20:32:35.7470224Z D=5120, 2025-05-07T20:32:35.7470317Z scale_ub=None, 2025-05-07T20:32:35.7470405Z contiguous=True, 2025-05-07T20:32:35.7470488Z compiled=True, 2025-05-07T20:32:35.7470565Z ) 2025-05-07T20:32:35.7470782Z self = 2025-05-07T20:32:35.7471020Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:35.7471062Z 2025-05-07T20:32:35.7471146Z @given( 2025-05-07T20:32:35.7471266Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.7471367Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.7471490Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.7471616Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.7471737Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.7471814Z ) 2025-05-07T20:32:35.7472056Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.7472155Z def test_silu_mul_quant( 2025-05-07T20:32:35.7472234Z self, 2025-05-07T20:32:35.7472310Z T: int, 2025-05-07T20:32:35.7472392Z D: int, 2025-05-07T20:32:35.7472492Z scale_ub: Optional[float], 2025-05-07T20:32:35.7472586Z contiguous: bool, 2025-05-07T20:32:35.7472675Z compiled: bool, 2025-05-07T20:32:35.7472753Z ) -> None: 2025-05-07T20:32:35.7472856Z torch.manual_seed(2025) 2025-05-07T20:32:35.7472937Z 2025-05-07T20:32:35.7473103Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.7473178Z 2025-05-07T20:32:35.7473273Z x_sign = torch.sign(x) 2025-05-07T20:32:35.7473445Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.7473540Z x = x_sign * x_clamp 2025-05-07T20:32:35.7473621Z x0 = x[:, :D] 2025-05-07T20:32:35.7473703Z x1 = x[:, D:] 2025-05-07T20:32:35.7473788Z 2025-05-07T20:32:35.7473871Z if contiguous: 2025-05-07T20:32:35.7473965Z x0 = x0.contiguous() 2025-05-07T20:32:35.7474062Z x1 = x1.contiguous() 2025-05-07T20:32:35.7474132Z 2025-05-07T20:32:35.7474224Z if scale_ub is not None: 2025-05-07T20:32:35.7474337Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.7474474Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.7474550Z ) 2025-05-07T20:32:35.7474634Z else: 2025-05-07T20:32:35.7474727Z scale_ub_tensor = None 2025-05-07T20:32:35.7474803Z 2025-05-07T20:32:35.7474935Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.7475027Z op = silu_mul_quant 2025-05-07T20:32:35.7475121Z if compiled: 2025-05-07T20:32:35.7475226Z op = torch.compile(op) 2025-05-07T20:32:35.7475331Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.7475408Z 2025-05-07T20:32:35.7475496Z y_fp8, y_scale = fn() 2025-05-07T20:32:35.7475617Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:35.7475691Z 2025-05-07T20:32:35.7475825Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.7475927Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:35.7476036Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:35.7476158Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:35.7476304Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:35.7476376Z 2025-05-07T20:32:35.7476477Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:35.7476482Z 2025-05-07T20:32:35.7476584Z moe/activation_test.py:126: 2025-05-07T20:32:35.7476761Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.7476866Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:35.7477006Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:35.7477570Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:35.7477675Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:35.7478029Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.7478335Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.7478713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:35.7478968Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:35.7479381Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:35.7479632Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:35.7480002Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:35.7480174Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:35.7480511Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:35.7480593Z fn() 2025-05-07T20:32:35.7480997Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:35.7481076Z self.fn.run( 2025-05-07T20:32:35.7481416Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.7481553Z kernel = self.compile( 2025-05-07T20:32:35.7481933Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.7482115Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.7482240Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.7482245Z 2025-05-07T20:32:35.7482449Z self = 2025-05-07T20:32:35.7483224Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.7483729Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd0977cf9a0>} 2025-05-07T20:32:35.7484471Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.7484661Z context = 2025-05-07T20:32:35.7484666Z 2025-05-07T20:32:35.7484835Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.7485105Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.7485213Z module_map=module_map) 2025-05-07T20:32:35.7485387Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.7485489Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:35.7485564Z E ^ 2025-05-07T20:32:35.7485919Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.7485968Z 2025-05-07T20:32:35.7486386Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.7486391Z 2025-05-07T20:32:35.7486504Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.7486722Z self=, 2025-05-07T20:32:35.7486797Z T=128, 2025-05-07T20:32:35.7486881Z D=5120, 2025-05-07T20:32:35.7486965Z scale_ub=None, 2025-05-07T20:32:35.7487051Z contiguous=True, 2025-05-07T20:32:35.7487138Z compiled=True, 2025-05-07T20:32:35.7487252Z ) 2025-05-07T20:32:35.7487468Z self = 2025-05-07T20:32:35.7487673Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:35.7487685Z 2025-05-07T20:32:35.7487760Z @given( 2025-05-07T20:32:35.7487877Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.7487985Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.7488100Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.7488217Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.7488343Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.7488418Z ) 2025-05-07T20:32:35.7488660Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.7488755Z def test_silu_mul_quant( 2025-05-07T20:32:35.7488829Z self, 2025-05-07T20:32:35.7488906Z T: int, 2025-05-07T20:32:35.7488989Z D: int, 2025-05-07T20:32:35.7489087Z scale_ub: Optional[float], 2025-05-07T20:32:35.7489178Z contiguous: bool, 2025-05-07T20:32:35.7489266Z compiled: bool, 2025-05-07T20:32:35.7489345Z ) -> None: 2025-05-07T20:32:35.7489446Z torch.manual_seed(2025) 2025-05-07T20:32:35.7489523Z 2025-05-07T20:32:35.7489691Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.7489770Z 2025-05-07T20:32:35.7490126Z x_sign = torch.sign(x) 2025-05-07T20:32:35.7490305Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.7490401Z x = x_sign * x_clamp 2025-05-07T20:32:35.7490478Z x0 = x[:, :D] 2025-05-07T20:32:35.7490554Z x1 = x[:, D:] 2025-05-07T20:32:35.7490628Z 2025-05-07T20:32:35.7490710Z if contiguous: 2025-05-07T20:32:35.7490802Z x0 = x0.contiguous() 2025-05-07T20:32:35.7490891Z x1 = x1.contiguous() 2025-05-07T20:32:35.7490958Z 2025-05-07T20:32:35.7491054Z if scale_ub is not None: 2025-05-07T20:32:35.7491186Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.7491341Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.7491420Z ) 2025-05-07T20:32:35.7491494Z else: 2025-05-07T20:32:35.7491585Z scale_ub_tensor = None 2025-05-07T20:32:35.7491659Z 2025-05-07T20:32:35.7491797Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.7491885Z op = silu_mul_quant 2025-05-07T20:32:35.7491973Z if compiled: 2025-05-07T20:32:35.7492070Z op = torch.compile(op) 2025-05-07T20:32:35.7492177Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.7492245Z 2025-05-07T20:32:35.7492333Z y_fp8, y_scale = fn() 2025-05-07T20:32:35.7492453Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:35.7492522Z 2025-05-07T20:32:35.7492656Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.7492760Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:35.7492857Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:35.7492977Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:35.7493124Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:35.7493191Z 2025-05-07T20:32:35.7493287Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:35.7493387Z 2025-05-07T20:32:35.7493487Z moe/activation_test.py:126: 2025-05-07T20:32:35.7493619Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.7493724Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:35.7493855Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:35.7494410Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:35.7494580Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:35.7494985Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.7495211Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.7495574Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:35.7495830Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:35.7496231Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:35.7496479Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:35.7496845Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:35.7497010Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:35.7497357Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:35.7497431Z fn() 2025-05-07T20:32:35.7497830Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:35.7497909Z self.fn.run( 2025-05-07T20:32:35.7498312Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.7498406Z kernel = self.compile( 2025-05-07T20:32:35.7498781Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.7498955Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.7499080Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.7499085Z 2025-05-07T20:32:35.7499293Z self = 2025-05-07T20:32:35.7500167Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.7500675Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd09694c700>} 2025-05-07T20:32:35.7501411Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.7501598Z context = 2025-05-07T20:32:35.7501603Z 2025-05-07T20:32:35.7501768Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.7502033Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.7502145Z module_map=module_map) 2025-05-07T20:32:35.7502307Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.7502409Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:35.7502488Z E ^ 2025-05-07T20:32:35.7502839Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.7502916Z 2025-05-07T20:32:35.7503327Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.7503334Z 2025-05-07T20:32:35.7503433Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.7503647Z self=, 2025-05-07T20:32:35.7503722Z T=4096, 2025-05-07T20:32:35.7503794Z D=5120, 2025-05-07T20:32:35.7503914Z scale_ub=None, 2025-05-07T20:32:35.7504001Z contiguous=True, 2025-05-07T20:32:35.7504078Z compiled=True, 2025-05-07T20:32:35.7504148Z ) 2025-05-07T20:32:35.7504405Z self = 2025-05-07T20:32:35.7504574Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:35.7504579Z 2025-05-07T20:32:35.7504652Z @given( 2025-05-07T20:32:35.7504772Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.7504869Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.7504988Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.7505101Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.7505214Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.7505290Z ) 2025-05-07T20:32:35.7505533Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.7505623Z def test_silu_mul_quant( 2025-05-07T20:32:35.7505709Z self, 2025-05-07T20:32:35.7505783Z T: int, 2025-05-07T20:32:35.7505860Z D: int, 2025-05-07T20:32:35.7505964Z scale_ub: Optional[float], 2025-05-07T20:32:35.7506050Z contiguous: bool, 2025-05-07T20:32:35.7506137Z compiled: bool, 2025-05-07T20:32:35.7506211Z ) -> None: 2025-05-07T20:32:35.7506304Z torch.manual_seed(2025) 2025-05-07T20:32:35.7506379Z 2025-05-07T20:32:35.7506593Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.7506664Z 2025-05-07T20:32:35.7506753Z x_sign = torch.sign(x) 2025-05-07T20:32:35.7506877Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.7506962Z x = x_sign * x_clamp 2025-05-07T20:32:35.7507041Z x0 = x[:, :D] 2025-05-07T20:32:35.7507115Z x1 = x[:, D:] 2025-05-07T20:32:35.7507187Z 2025-05-07T20:32:35.7507279Z if contiguous: 2025-05-07T20:32:35.7507366Z x0 = x0.contiguous() 2025-05-07T20:32:35.7507459Z x1 = x1.contiguous() 2025-05-07T20:32:35.7507527Z 2025-05-07T20:32:35.7507619Z if scale_ub is not None: 2025-05-07T20:32:35.7507726Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.7507857Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.7507929Z ) 2025-05-07T20:32:35.7508006Z else: 2025-05-07T20:32:35.7508107Z scale_ub_tensor = None 2025-05-07T20:32:35.7508177Z 2025-05-07T20:32:35.7508311Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.7508399Z op = silu_mul_quant 2025-05-07T20:32:35.7508482Z if compiled: 2025-05-07T20:32:35.7508583Z op = torch.compile(op) 2025-05-07T20:32:35.7508688Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.7508757Z 2025-05-07T20:32:35.7508847Z y_fp8, y_scale = fn() 2025-05-07T20:32:35.7508967Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:35.7509041Z 2025-05-07T20:32:35.7509178Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.7509280Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:35.7509382Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:35.7509503Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:35.7509641Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:35.7509766Z 2025-05-07T20:32:35.7509866Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:35.7509870Z 2025-05-07T20:32:35.7509969Z moe/activation_test.py:126: 2025-05-07T20:32:35.7510093Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.7510196Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:35.7510334Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:35.7510883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:35.7511022Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:35.7511415Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.7511638Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.7512019Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:35.7512267Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:35.7512664Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:35.7512916Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:35.7513289Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:35.7513463Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:35.7513798Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:35.7513869Z fn() 2025-05-07T20:32:35.7514266Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:35.7514393Z self.fn.run( 2025-05-07T20:32:35.7514731Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.7514828Z kernel = self.compile( 2025-05-07T20:32:35.7515198Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.7515373Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.7515499Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.7515507Z 2025-05-07T20:32:35.7515713Z self = 2025-05-07T20:32:35.7516480Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.7516976Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd096894280>} 2025-05-07T20:32:35.7517709Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.7517894Z context = 2025-05-07T20:32:35.7517902Z 2025-05-07T20:32:35.7518062Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.7518327Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.7518430Z module_map=module_map) 2025-05-07T20:32:35.7518596Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.7518695Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:35.7518813Z E ^ 2025-05-07T20:32:35.7519167Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.7519173Z 2025-05-07T20:32:35.7519580Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.7519585Z 2025-05-07T20:32:35.7519689Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.7519905Z self=, 2025-05-07T20:32:35.7520020Z T=16384, 2025-05-07T20:32:35.7520096Z D=5120, 2025-05-07T20:32:35.7520177Z scale_ub=None, 2025-05-07T20:32:35.7520303Z contiguous=True, 2025-05-07T20:32:35.7520386Z compiled=True, 2025-05-07T20:32:35.7520455Z ) 2025-05-07T20:32:35.7520667Z self = 2025-05-07T20:32:35.7520848Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:35.7520858Z 2025-05-07T20:32:35.7520931Z @given( 2025-05-07T20:32:35.7521058Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.7521172Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.7521304Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.7521431Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.7521545Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.7521615Z ) 2025-05-07T20:32:35.7521866Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.7521961Z def test_silu_mul_quant( 2025-05-07T20:32:35.7522036Z self, 2025-05-07T20:32:35.7522112Z T: int, 2025-05-07T20:32:35.7522183Z D: int, 2025-05-07T20:32:35.7522280Z scale_ub: Optional[float], 2025-05-07T20:32:35.7522370Z contiguous: bool, 2025-05-07T20:32:35.7522454Z compiled: bool, 2025-05-07T20:32:35.7522538Z ) -> None: 2025-05-07T20:32:35.7522673Z torch.manual_seed(2025) 2025-05-07T20:32:35.7522745Z 2025-05-07T20:32:35.7522916Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.7522983Z 2025-05-07T20:32:35.7523072Z x_sign = torch.sign(x) 2025-05-07T20:32:35.7523203Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.7523287Z x = x_sign * x_clamp 2025-05-07T20:32:35.7523361Z x0 = x[:, :D] 2025-05-07T20:32:35.7523445Z x1 = x[:, D:] 2025-05-07T20:32:35.7523518Z 2025-05-07T20:32:35.7523599Z if contiguous: 2025-05-07T20:32:35.7523696Z x0 = x0.contiguous() 2025-05-07T20:32:35.7523785Z x1 = x1.contiguous() 2025-05-07T20:32:35.7523857Z 2025-05-07T20:32:35.7523950Z if scale_ub is not None: 2025-05-07T20:32:35.7524052Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.7524195Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.7524271Z ) 2025-05-07T20:32:35.7524348Z else: 2025-05-07T20:32:35.7524446Z scale_ub_tensor = None 2025-05-07T20:32:35.7524513Z 2025-05-07T20:32:35.7524646Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.7524734Z op = silu_mul_quant 2025-05-07T20:32:35.7524819Z if compiled: 2025-05-07T20:32:35.7524922Z op = torch.compile(op) 2025-05-07T20:32:35.7525027Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.7525100Z 2025-05-07T20:32:35.7525200Z y_fp8, y_scale = fn() 2025-05-07T20:32:35.7525317Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:35.7525395Z 2025-05-07T20:32:35.7525536Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.7525634Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:35.7525735Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:35.7525907Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:35.7526044Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:35.7526122Z 2025-05-07T20:32:35.7526219Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:35.7526224Z 2025-05-07T20:32:35.7526321Z moe/activation_test.py:126: 2025-05-07T20:32:35.7526452Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.7526554Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:35.7526684Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:35.7527314Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:35.7527415Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:35.7527779Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.7528003Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.7528370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:35.7528624Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:35.7529012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:35.7529263Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:35.7529641Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:35.7529809Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:35.7530154Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:35.7530232Z fn() 2025-05-07T20:32:35.7530689Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:35.7530772Z self.fn.run( 2025-05-07T20:32:35.7531103Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.7531202Z kernel = self.compile( 2025-05-07T20:32:35.7531575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.7531745Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.7531879Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.7531883Z 2025-05-07T20:32:35.7532082Z self = 2025-05-07T20:32:35.7532849Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.7533353Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd096894a60>} 2025-05-07T20:32:35.7534086Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.7534280Z context = 2025-05-07T20:32:35.7534285Z 2025-05-07T20:32:35.7534449Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.7534716Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.7534820Z module_map=module_map) 2025-05-07T20:32:35.7535026Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.7535131Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:35.7535205Z E ^ 2025-05-07T20:32:35.7535558Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.7535562Z 2025-05-07T20:32:35.7535966Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.7535971Z 2025-05-07T20:32:35.7536070Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.7536331Z self=, 2025-05-07T20:32:35.7536441Z T=1, 2025-05-07T20:32:35.7536514Z D=5120, 2025-05-07T20:32:35.7536598Z scale_ub=1200.0, 2025-05-07T20:32:35.7536682Z contiguous=True, 2025-05-07T20:32:35.7536764Z compiled=True, 2025-05-07T20:32:35.7536835Z ) 2025-05-07T20:32:35.7537047Z self = 2025-05-07T20:32:35.7537215Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:35.7537219Z 2025-05-07T20:32:35.7537293Z @given( 2025-05-07T20:32:35.7537407Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.7537507Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.7537621Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.7537735Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.7537847Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.7537920Z ) 2025-05-07T20:32:35.7538168Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.7538261Z def test_silu_mul_quant( 2025-05-07T20:32:35.7538336Z self, 2025-05-07T20:32:35.7538410Z T: int, 2025-05-07T20:32:35.7538481Z D: int, 2025-05-07T20:32:35.7538577Z scale_ub: Optional[float], 2025-05-07T20:32:35.7538714Z contiguous: bool, 2025-05-07T20:32:35.7538799Z compiled: bool, 2025-05-07T20:32:35.7538876Z ) -> None: 2025-05-07T20:32:35.7538973Z torch.manual_seed(2025) 2025-05-07T20:32:35.7539046Z 2025-05-07T20:32:35.7539212Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.7539290Z 2025-05-07T20:32:35.7539383Z x_sign = torch.sign(x) 2025-05-07T20:32:35.7539510Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.7539598Z x = x_sign * x_clamp 2025-05-07T20:32:35.7539678Z x0 = x[:, :D] 2025-05-07T20:32:35.7539912Z x1 = x[:, D:] 2025-05-07T20:32:35.7539983Z 2025-05-07T20:32:35.7540068Z if contiguous: 2025-05-07T20:32:35.7540166Z x0 = x0.contiguous() 2025-05-07T20:32:35.7540253Z x1 = x1.contiguous() 2025-05-07T20:32:35.7540320Z 2025-05-07T20:32:35.7540414Z if scale_ub is not None: 2025-05-07T20:32:35.7540520Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.7540659Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.7540738Z ) 2025-05-07T20:32:35.7540811Z else: 2025-05-07T20:32:35.7540903Z scale_ub_tensor = None 2025-05-07T20:32:35.7540977Z 2025-05-07T20:32:35.7541103Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.7541194Z op = silu_mul_quant 2025-05-07T20:32:35.7541278Z if compiled: 2025-05-07T20:32:35.7541375Z op = torch.compile(op) 2025-05-07T20:32:35.7541487Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.7541557Z 2025-05-07T20:32:35.7541649Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.7541654Z 2025-05-07T20:32:35.7541757Z moe/activation_test.py:117: 2025-05-07T20:32:35.7541885Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.7541985Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.7542141Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.7542501Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:35.7542596Z return fn(*args, **kwargs) 2025-05-07T20:32:35.7543084Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.7543180Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.7543535Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.7543835Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.7544180Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.7544274Z kernel = self.compile( 2025-05-07T20:32:35.7544658Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.7544839Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.7544962Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.7544966Z 2025-05-07T20:32:35.7545168Z self = 2025-05-07T20:32:35.7545933Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.7546440Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd09688f1c0>} 2025-05-07T20:32:35.7547214Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.7547405Z context = 2025-05-07T20:32:35.7547410Z 2025-05-07T20:32:35.7547572Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.7547828Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.7547932Z module_map=module_map) 2025-05-07T20:32:35.7548296Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.7548396Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.7548469Z E ^ 2025-05-07T20:32:35.7548823Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.7548828Z 2025-05-07T20:32:35.7549240Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.7549250Z 2025-05-07T20:32:35.7549356Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.7549574Z self=, 2025-05-07T20:32:35.7549645Z T=1, 2025-05-07T20:32:35.7549722Z D=5120, 2025-05-07T20:32:35.7549801Z scale_ub=None, 2025-05-07T20:32:35.7549882Z contiguous=False, 2025-05-07T20:32:35.7549967Z compiled=True, 2025-05-07T20:32:35.7550039Z ) 2025-05-07T20:32:35.7550254Z self = 2025-05-07T20:32:35.7550417Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:35.7550421Z 2025-05-07T20:32:35.7550496Z @given( 2025-05-07T20:32:35.7550616Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.7550714Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.7550829Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.7551001Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.7551114Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.7551181Z ) 2025-05-07T20:32:35.7551427Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.7551516Z def test_silu_mul_quant( 2025-05-07T20:32:35.7551590Z self, 2025-05-07T20:32:35.7551662Z T: int, 2025-05-07T20:32:35.7551736Z D: int, 2025-05-07T20:32:35.7551839Z scale_ub: Optional[float], 2025-05-07T20:32:35.7551926Z contiguous: bool, 2025-05-07T20:32:35.7552051Z compiled: bool, 2025-05-07T20:32:35.7552129Z ) -> None: 2025-05-07T20:32:35.7552258Z torch.manual_seed(2025) 2025-05-07T20:32:35.7552324Z 2025-05-07T20:32:35.7552496Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.7552564Z 2025-05-07T20:32:35.7552652Z x_sign = torch.sign(x) 2025-05-07T20:32:35.7552778Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.7552868Z x = x_sign * x_clamp 2025-05-07T20:32:35.7552951Z x0 = x[:, :D] 2025-05-07T20:32:35.7553026Z x1 = x[:, D:] 2025-05-07T20:32:35.7553096Z 2025-05-07T20:32:35.7553179Z if contiguous: 2025-05-07T20:32:35.7553269Z x0 = x0.contiguous() 2025-05-07T20:32:35.7553353Z x1 = x1.contiguous() 2025-05-07T20:32:35.7553426Z 2025-05-07T20:32:35.7553510Z if scale_ub is not None: 2025-05-07T20:32:35.7553611Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.7553751Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.7553825Z ) 2025-05-07T20:32:35.7553900Z else: 2025-05-07T20:32:35.7554004Z scale_ub_tensor = None 2025-05-07T20:32:35.7554074Z 2025-05-07T20:32:35.7554201Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.7554291Z op = silu_mul_quant 2025-05-07T20:32:35.7554376Z if compiled: 2025-05-07T20:32:35.7554521Z op = torch.compile(op) 2025-05-07T20:32:35.7554626Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.7554697Z 2025-05-07T20:32:35.7554789Z y_fp8, y_scale = fn() 2025-05-07T20:32:35.7554908Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:35.7554980Z 2025-05-07T20:32:35.7555119Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.7555217Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:35.7555316Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:35.7555446Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:35.7555586Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:35.7555659Z 2025-05-07T20:32:35.7555758Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:35.7555762Z 2025-05-07T20:32:35.7555856Z moe/activation_test.py:126: 2025-05-07T20:32:35.7555989Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.7556089Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:35.7556220Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:35.7556769Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:35.7556865Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:35.7557220Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.7557443Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.7557804Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:35.7558057Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:35.7558498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:35.7558748Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:35.7559116Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:35.7559277Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:35.7559621Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:35.7559736Z fn() 2025-05-07T20:32:35.7560194Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:35.7560277Z self.fn.run( 2025-05-07T20:32:35.7560616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.7560716Z kernel = self.compile( 2025-05-07T20:32:35.7561144Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.7561327Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.7561454Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.7561458Z 2025-05-07T20:32:35.7561661Z self = 2025-05-07T20:32:35.7562426Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.7562922Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd09688e680>} 2025-05-07T20:32:35.7563697Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.7563886Z context = 2025-05-07T20:32:35.7563891Z 2025-05-07T20:32:35.7564053Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.7564313Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.7564421Z module_map=module_map) 2025-05-07T20:32:35.7564579Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.7564683Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:35.7564756Z E ^ 2025-05-07T20:32:35.7565103Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.7565112Z 2025-05-07T20:32:35.7565529Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.7565534Z 2025-05-07T20:32:35.7565634Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.7565854Z self=, 2025-05-07T20:32:35.7565928Z T=1, 2025-05-07T20:32:35.7566000Z D=5120, 2025-05-07T20:32:35.7566080Z scale_ub=None, 2025-05-07T20:32:35.7566162Z contiguous=True, 2025-05-07T20:32:35.7566241Z compiled=False, 2025-05-07T20:32:35.7566316Z ) 2025-05-07T20:32:35.7566528Z self = 2025-05-07T20:32:35.7566698Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:35.7566703Z 2025-05-07T20:32:35.7566778Z @given( 2025-05-07T20:32:35.7566894Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.7566994Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.7567156Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.7567271Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.7567389Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.7567460Z ) 2025-05-07T20:32:35.7567701Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.7567792Z def test_silu_mul_quant( 2025-05-07T20:32:35.7567864Z self, 2025-05-07T20:32:35.7567944Z T: int, 2025-05-07T20:32:35.7568015Z D: int, 2025-05-07T20:32:35.7568157Z scale_ub: Optional[float], 2025-05-07T20:32:35.7568245Z contiguous: bool, 2025-05-07T20:32:35.7568365Z compiled: bool, 2025-05-07T20:32:35.7568442Z ) -> None: 2025-05-07T20:32:35.7568540Z torch.manual_seed(2025) 2025-05-07T20:32:35.7568608Z 2025-05-07T20:32:35.7568774Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.7568853Z 2025-05-07T20:32:35.7568945Z x_sign = torch.sign(x) 2025-05-07T20:32:35.7569067Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.7569156Z x = x_sign * x_clamp 2025-05-07T20:32:35.7569233Z x0 = x[:, :D] 2025-05-07T20:32:35.7569312Z x1 = x[:, D:] 2025-05-07T20:32:35.7569383Z 2025-05-07T20:32:35.7569465Z if contiguous: 2025-05-07T20:32:35.7569560Z x0 = x0.contiguous() 2025-05-07T20:32:35.7569645Z x1 = x1.contiguous() 2025-05-07T20:32:35.7569717Z 2025-05-07T20:32:35.7569813Z if scale_ub is not None: 2025-05-07T20:32:35.7569915Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.7570050Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.7570125Z ) 2025-05-07T20:32:35.7570199Z else: 2025-05-07T20:32:35.7570294Z scale_ub_tensor = None 2025-05-07T20:32:35.7570364Z 2025-05-07T20:32:35.7570490Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.7570628Z op = silu_mul_quant 2025-05-07T20:32:35.7570723Z if compiled: 2025-05-07T20:32:35.7570820Z op = torch.compile(op) 2025-05-07T20:32:35.7570927Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.7570994Z 2025-05-07T20:32:35.7571085Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.7571090Z 2025-05-07T20:32:35.7571188Z moe/activation_test.py:117: 2025-05-07T20:32:35.7571313Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.7571414Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.7571513Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.7572006Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.7572104Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.7572458Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.7572676Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.7573010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.7573100Z kernel = self.compile( 2025-05-07T20:32:35.7573476Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.7573651Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.7573776Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.7573782Z 2025-05-07T20:32:35.7573987Z self = 2025-05-07T20:32:35.7574749Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.7575300Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd096dbd900>} 2025-05-07T20:32:35.7576041Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.7576271Z context = 2025-05-07T20:32:35.7576275Z 2025-05-07T20:32:35.7576478Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.7576741Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.7576846Z module_map=module_map) 2025-05-07T20:32:35.7577005Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.7577107Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.7577184Z E ^ 2025-05-07T20:32:35.7577533Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.7577537Z 2025-05-07T20:32:35.7577948Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.7577953Z 2025-05-07T20:32:35.7578060Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.7578280Z self=, 2025-05-07T20:32:35.7578355Z T=128, 2025-05-07T20:32:35.7578431Z D=5120, 2025-05-07T20:32:35.7578508Z scale_ub=None, 2025-05-07T20:32:35.7578593Z contiguous=False, 2025-05-07T20:32:35.7578669Z compiled=True, 2025-05-07T20:32:35.7578737Z ) 2025-05-07T20:32:35.7578953Z self = 2025-05-07T20:32:35.7579165Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:35.7579170Z 2025-05-07T20:32:35.7579243Z @given( 2025-05-07T20:32:35.7579364Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.7579459Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.7579580Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.7579692Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.7579916Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.7579997Z ) 2025-05-07T20:32:35.7580238Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.7580331Z def test_silu_mul_quant( 2025-05-07T20:32:35.7580413Z self, 2025-05-07T20:32:35.7580483Z T: int, 2025-05-07T20:32:35.7580555Z D: int, 2025-05-07T20:32:35.7580655Z scale_ub: Optional[float], 2025-05-07T20:32:35.7580743Z contiguous: bool, 2025-05-07T20:32:35.7580829Z compiled: bool, 2025-05-07T20:32:35.7580910Z ) -> None: 2025-05-07T20:32:35.7581002Z torch.manual_seed(2025) 2025-05-07T20:32:35.7581075Z 2025-05-07T20:32:35.7581242Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.7581311Z 2025-05-07T20:32:35.7581402Z x_sign = torch.sign(x) 2025-05-07T20:32:35.7581526Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.7581614Z x = x_sign * x_clamp 2025-05-07T20:32:35.7581701Z x0 = x[:, :D] 2025-05-07T20:32:35.7581774Z x1 = x[:, D:] 2025-05-07T20:32:35.7581839Z 2025-05-07T20:32:35.7581922Z if contiguous: 2025-05-07T20:32:35.7582013Z x0 = x0.contiguous() 2025-05-07T20:32:35.7582101Z x1 = x1.contiguous() 2025-05-07T20:32:35.7582175Z 2025-05-07T20:32:35.7582263Z if scale_ub is not None: 2025-05-07T20:32:35.7582370Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.7582631Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.7582702Z ) 2025-05-07T20:32:35.7582780Z else: 2025-05-07T20:32:35.7582874Z scale_ub_tensor = None 2025-05-07T20:32:35.7582942Z 2025-05-07T20:32:35.7583075Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.7583164Z op = silu_mul_quant 2025-05-07T20:32:35.7583248Z if compiled: 2025-05-07T20:32:35.7583350Z op = torch.compile(op) 2025-05-07T20:32:35.7583497Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.7583563Z 2025-05-07T20:32:35.7583655Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.7583696Z 2025-05-07T20:32:35.7583794Z moe/activation_test.py:117: 2025-05-07T20:32:35.7583924Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.7584026Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.7587545Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.7587943Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:35.7588038Z return fn(*args, **kwargs) 2025-05-07T20:32:35.7588529Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.7588626Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.7588984Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.7589209Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.7589550Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.7589649Z kernel = self.compile( 2025-05-07T20:32:35.7590298Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.7590588Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.7590721Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.7590726Z 2025-05-07T20:32:35.7590934Z self = 2025-05-07T20:32:35.7591698Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.7592200Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd096dbfeb0>} 2025-05-07T20:32:35.7592942Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.7593138Z context = 2025-05-07T20:32:35.7593142Z 2025-05-07T20:32:35.7593308Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.7593575Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.7593681Z module_map=module_map) 2025-05-07T20:32:35.7593839Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.7593944Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.7594016Z E ^ 2025-05-07T20:32:35.7594367Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.7594376Z 2025-05-07T20:32:35.7594784Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.7594851Z 2025-05-07T20:32:35.7594960Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.7595184Z self=, 2025-05-07T20:32:35.7595255Z T=128, 2025-05-07T20:32:35.7595326Z D=7168, 2025-05-07T20:32:35.7595409Z scale_ub=1200.0, 2025-05-07T20:32:35.7595492Z contiguous=False, 2025-05-07T20:32:35.7595574Z compiled=False, 2025-05-07T20:32:35.7595648Z ) 2025-05-07T20:32:35.7595861Z self = 2025-05-07T20:32:35.7596127Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:35.7596132Z 2025-05-07T20:32:35.7596267Z @given( 2025-05-07T20:32:35.7596389Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.7596489Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.7596604Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.7596719Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.7596841Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.7596914Z ) 2025-05-07T20:32:35.7597162Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.7597253Z def test_silu_mul_quant( 2025-05-07T20:32:35.7597328Z self, 2025-05-07T20:32:35.7597409Z T: int, 2025-05-07T20:32:35.7597484Z D: int, 2025-05-07T20:32:35.7597582Z scale_ub: Optional[float], 2025-05-07T20:32:35.7597672Z contiguous: bool, 2025-05-07T20:32:35.7597757Z compiled: bool, 2025-05-07T20:32:35.7597834Z ) -> None: 2025-05-07T20:32:35.7597933Z torch.manual_seed(2025) 2025-05-07T20:32:35.7598006Z 2025-05-07T20:32:35.7598174Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.7598253Z 2025-05-07T20:32:35.7598346Z x_sign = torch.sign(x) 2025-05-07T20:32:35.7598468Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.7598619Z x = x_sign * x_clamp 2025-05-07T20:32:35.7598703Z x0 = x[:, :D] 2025-05-07T20:32:35.7598786Z x1 = x[:, D:] 2025-05-07T20:32:35.7598858Z 2025-05-07T20:32:35.7598940Z if contiguous: 2025-05-07T20:32:35.7599037Z x0 = x0.contiguous() 2025-05-07T20:32:35.7599127Z x1 = x1.contiguous() 2025-05-07T20:32:35.7599199Z 2025-05-07T20:32:35.7599296Z if scale_ub is not None: 2025-05-07T20:32:35.7599406Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.7599545Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.7599628Z ) 2025-05-07T20:32:35.7599707Z else: 2025-05-07T20:32:35.7599802Z scale_ub_tensor = None 2025-05-07T20:32:35.7599880Z 2025-05-07T20:32:35.7600012Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.7600102Z op = silu_mul_quant 2025-05-07T20:32:35.7600189Z if compiled: 2025-05-07T20:32:35.7600297Z op = torch.compile(op) 2025-05-07T20:32:35.7600408Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.7600478Z 2025-05-07T20:32:35.7600570Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.7600574Z 2025-05-07T20:32:35.7600676Z moe/activation_test.py:117: 2025-05-07T20:32:35.7600803Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.7600901Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.7601002Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.7601509Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.7601611Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.7601965Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.7602186Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.7602580Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.7602674Z kernel = self.compile( 2025-05-07T20:32:35.7603051Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.7603226Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.7603353Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.7603400Z 2025-05-07T20:32:35.7603610Z self = 2025-05-07T20:32:35.7604414Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.7604929Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd096dbd7e0>} 2025-05-07T20:32:35.7605669Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.7605860Z context = 2025-05-07T20:32:35.7605865Z 2025-05-07T20:32:35.7606032Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.7606297Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.7606410Z module_map=module_map) 2025-05-07T20:32:35.7606573Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.7606671Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.7606756Z E ^ 2025-05-07T20:32:35.7607149Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.7607155Z 2025-05-07T20:32:35.7607573Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.7607583Z 2025-05-07T20:32:35.7607693Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.7607911Z self=, 2025-05-07T20:32:35.7607996Z T=128, 2025-05-07T20:32:35.7608072Z D=5120, 2025-05-07T20:32:35.7608153Z scale_ub=None, 2025-05-07T20:32:35.7608241Z contiguous=False, 2025-05-07T20:32:35.7608327Z compiled=False, 2025-05-07T20:32:35.7608399Z ) 2025-05-07T20:32:35.7608615Z self = 2025-05-07T20:32:35.7608785Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:35.7608792Z 2025-05-07T20:32:35.7608873Z @given( 2025-05-07T20:32:35.7608994Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.7609093Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.7609212Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.7609335Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.7609449Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.7609528Z ) 2025-05-07T20:32:35.7609772Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.7609867Z def test_silu_mul_quant( 2025-05-07T20:32:35.7609947Z self, 2025-05-07T20:32:35.7610025Z T: int, 2025-05-07T20:32:35.7610101Z D: int, 2025-05-07T20:32:35.7610204Z scale_ub: Optional[float], 2025-05-07T20:32:35.7610292Z contiguous: bool, 2025-05-07T20:32:35.7610385Z compiled: bool, 2025-05-07T20:32:35.7610466Z ) -> None: 2025-05-07T20:32:35.7610609Z torch.manual_seed(2025) 2025-05-07T20:32:35.7610684Z 2025-05-07T20:32:35.7610852Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.7610927Z 2025-05-07T20:32:35.7611026Z x_sign = torch.sign(x) 2025-05-07T20:32:35.7611148Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.7611235Z x = x_sign * x_clamp 2025-05-07T20:32:35.7611320Z x0 = x[:, :D] 2025-05-07T20:32:35.7611406Z x1 = x[:, D:] 2025-05-07T20:32:35.7611476Z 2025-05-07T20:32:35.7611606Z if contiguous: 2025-05-07T20:32:35.7611703Z x0 = x0.contiguous() 2025-05-07T20:32:35.7611789Z x1 = x1.contiguous() 2025-05-07T20:32:35.7611898Z 2025-05-07T20:32:35.7611995Z if scale_ub is not None: 2025-05-07T20:32:35.7612100Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.7612234Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.7612315Z ) 2025-05-07T20:32:35.7612393Z else: 2025-05-07T20:32:35.7612490Z scale_ub_tensor = None 2025-05-07T20:32:35.7612560Z 2025-05-07T20:32:35.7612687Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.7612777Z op = silu_mul_quant 2025-05-07T20:32:35.7612861Z if compiled: 2025-05-07T20:32:35.7612960Z op = torch.compile(op) 2025-05-07T20:32:35.7613073Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.7613146Z 2025-05-07T20:32:35.7613236Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.7613243Z 2025-05-07T20:32:35.7613341Z moe/activation_test.py:117: 2025-05-07T20:32:35.7613471Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.7613576Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.7613679Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.7614216Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.7614322Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.7614677Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.7614895Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.7615237Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.7615331Z kernel = self.compile( 2025-05-07T20:32:35.7615715Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.7615892Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.7616017Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.7616022Z 2025-05-07T20:32:35.7616231Z self = 2025-05-07T20:32:35.7617000Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.7617497Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd09696beb0>} 2025-05-07T20:32:35.7618236Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.7618427Z context = 2025-05-07T20:32:35.7618437Z 2025-05-07T20:32:35.7618603Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.7618862Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.7619019Z module_map=module_map) 2025-05-07T20:32:35.7619178Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.7619275Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.7619353Z E ^ 2025-05-07T20:32:35.7619704Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.7619709Z 2025-05-07T20:32:35.7620296Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.7620348Z 2025-05-07T20:32:35.7620492Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.7620714Z self=, 2025-05-07T20:32:35.7620795Z T=128, 2025-05-07T20:32:35.7620878Z D=5120, 2025-05-07T20:32:35.7620980Z scale_ub=1200.0, 2025-05-07T20:32:35.7621085Z contiguous=True, 2025-05-07T20:32:35.7621182Z compiled=False, 2025-05-07T20:32:35.7621255Z ) 2025-05-07T20:32:35.7621477Z self = 2025-05-07T20:32:35.7621644Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:35.7621648Z 2025-05-07T20:32:35.7621727Z @given( 2025-05-07T20:32:35.7621844Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.7621942Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.7622063Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.7622180Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.7622297Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.7622378Z ) 2025-05-07T20:32:35.7622623Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.7622715Z def test_silu_mul_quant( 2025-05-07T20:32:35.7622800Z self, 2025-05-07T20:32:35.7622918Z T: int, 2025-05-07T20:32:35.7622999Z D: int, 2025-05-07T20:32:35.7623096Z scale_ub: Optional[float], 2025-05-07T20:32:35.7623184Z contiguous: bool, 2025-05-07T20:32:35.7623275Z compiled: bool, 2025-05-07T20:32:35.7623352Z ) -> None: 2025-05-07T20:32:35.7623444Z torch.manual_seed(2025) 2025-05-07T20:32:35.7623520Z 2025-05-07T20:32:35.7623690Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.7623762Z 2025-05-07T20:32:35.7623864Z x_sign = torch.sign(x) 2025-05-07T20:32:35.7623987Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.7624079Z x = x_sign * x_clamp 2025-05-07T20:32:35.7624163Z x0 = x[:, :D] 2025-05-07T20:32:35.7624243Z x1 = x[:, D:] 2025-05-07T20:32:35.7624318Z 2025-05-07T20:32:35.7624402Z if contiguous: 2025-05-07T20:32:35.7624493Z x0 = x0.contiguous() 2025-05-07T20:32:35.7624588Z x1 = x1.contiguous() 2025-05-07T20:32:35.7624661Z 2025-05-07T20:32:35.7624752Z if scale_ub is not None: 2025-05-07T20:32:35.7624860Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.7624996Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.7625070Z ) 2025-05-07T20:32:35.7625152Z else: 2025-05-07T20:32:35.7625245Z scale_ub_tensor = None 2025-05-07T20:32:35.7625316Z 2025-05-07T20:32:35.7625449Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.7625542Z op = silu_mul_quant 2025-05-07T20:32:35.7625627Z if compiled: 2025-05-07T20:32:35.7625731Z op = torch.compile(op) 2025-05-07T20:32:35.7625840Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.7625917Z 2025-05-07T20:32:35.7626007Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.7626011Z 2025-05-07T20:32:35.7626106Z moe/activation_test.py:117: 2025-05-07T20:32:35.7626306Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.7626408Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.7626504Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.7627001Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.7627099Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.7627457Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.7627724Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.7628104Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.7628204Z kernel = self.compile( 2025-05-07T20:32:35.7628589Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.7628769Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.7628899Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.7628903Z 2025-05-07T20:32:35.7629109Z self = 2025-05-07T20:32:35.7629880Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.7630380Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd09696ad40>} 2025-05-07T20:32:35.7631165Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.7631358Z context = 2025-05-07T20:32:35.7631362Z 2025-05-07T20:32:35.7631524Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.7631788Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.7631894Z module_map=module_map) 2025-05-07T20:32:35.7632057Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.7632156Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.7632232Z E ^ 2025-05-07T20:32:35.7632588Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.7632593Z 2025-05-07T20:32:35.7633009Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.7633016Z 2025-05-07T20:32:35.7633122Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.7633343Z self=, 2025-05-07T20:32:35.7633422Z T=1, 2025-05-07T20:32:35.7633504Z D=7168, 2025-05-07T20:32:35.7633585Z scale_ub=1200.0, 2025-05-07T20:32:35.7633669Z contiguous=True, 2025-05-07T20:32:35.7633753Z compiled=True, 2025-05-07T20:32:35.7633826Z ) 2025-05-07T20:32:35.7634038Z self = 2025-05-07T20:32:35.7634210Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:35.7634215Z 2025-05-07T20:32:35.7634296Z @given( 2025-05-07T20:32:35.7634414Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.7634517Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.7634632Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.7634796Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.7634916Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.7634988Z ) 2025-05-07T20:32:35.7635239Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.7635332Z def test_silu_mul_quant( 2025-05-07T20:32:35.7635408Z self, 2025-05-07T20:32:35.7635488Z T: int, 2025-05-07T20:32:35.7635563Z D: int, 2025-05-07T20:32:35.7635661Z scale_ub: Optional[float], 2025-05-07T20:32:35.7635754Z contiguous: bool, 2025-05-07T20:32:35.7635885Z compiled: bool, 2025-05-07T20:32:35.7635961Z ) -> None: 2025-05-07T20:32:35.7636061Z torch.manual_seed(2025) 2025-05-07T20:32:35.7636172Z 2025-05-07T20:32:35.7636345Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.7636419Z 2025-05-07T20:32:35.7636511Z x_sign = torch.sign(x) 2025-05-07T20:32:35.7636638Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.7636732Z x = x_sign * x_clamp 2025-05-07T20:32:35.7636811Z x0 = x[:, :D] 2025-05-07T20:32:35.7636896Z x1 = x[:, D:] 2025-05-07T20:32:35.7636969Z 2025-05-07T20:32:35.7637052Z if contiguous: 2025-05-07T20:32:35.7637146Z x0 = x0.contiguous() 2025-05-07T20:32:35.7637235Z x1 = x1.contiguous() 2025-05-07T20:32:35.7637307Z 2025-05-07T20:32:35.7637401Z if scale_ub is not None: 2025-05-07T20:32:35.7637505Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.7637644Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.7637720Z ) 2025-05-07T20:32:35.7637801Z else: 2025-05-07T20:32:35.7637897Z scale_ub_tensor = None 2025-05-07T20:32:35.7637968Z 2025-05-07T20:32:35.7638095Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.7638189Z op = silu_mul_quant 2025-05-07T20:32:35.7638278Z if compiled: 2025-05-07T20:32:35.7638421Z op = torch.compile(op) 2025-05-07T20:32:35.7638530Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.7638600Z 2025-05-07T20:32:35.7638689Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.7638694Z 2025-05-07T20:32:35.7638794Z moe/activation_test.py:117: 2025-05-07T20:32:35.7638919Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.7639027Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.7639125Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.7639491Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:35.7639590Z return fn(*args, **kwargs) 2025-05-07T20:32:35.7640085Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.7640181Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.7640543Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.7640769Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.7641107Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.7641200Z kernel = self.compile( 2025-05-07T20:32:35.7641578Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.7641759Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.7641886Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.7641890Z 2025-05-07T20:32:35.7642095Z self = 2025-05-07T20:32:35.7642866Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.7643413Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd096969480>} 2025-05-07T20:32:35.7644155Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.7644385Z context = 2025-05-07T20:32:35.7644426Z 2025-05-07T20:32:35.7644595Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.7644856Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.7644963Z module_map=module_map) 2025-05-07T20:32:35.7645134Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.7645231Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.7645311Z E ^ 2025-05-07T20:32:35.7645661Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.7645665Z 2025-05-07T20:32:35.7646080Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.7646087Z 2025-05-07T20:32:35.7646198Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.7646419Z self=, 2025-05-07T20:32:35.7646494Z T=1, 2025-05-07T20:32:35.7646572Z D=7168, 2025-05-07T20:32:35.7646655Z scale_ub=1200.0, 2025-05-07T20:32:35.7646749Z contiguous=False, 2025-05-07T20:32:35.7646833Z compiled=True, 2025-05-07T20:32:35.7646908Z ) 2025-05-07T20:32:35.7647168Z self = 2025-05-07T20:32:35.7647336Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:35.7647341Z 2025-05-07T20:32:35.7647416Z @given( 2025-05-07T20:32:35.7647539Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.7647636Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.7647751Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.7647871Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.7647989Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.7648067Z ) 2025-05-07T20:32:35.7648316Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.7648408Z def test_silu_mul_quant( 2025-05-07T20:32:35.7648487Z self, 2025-05-07T20:32:35.7648561Z T: int, 2025-05-07T20:32:35.7648638Z D: int, 2025-05-07T20:32:35.7648746Z scale_ub: Optional[float], 2025-05-07T20:32:35.7648836Z contiguous: bool, 2025-05-07T20:32:35.7648927Z compiled: bool, 2025-05-07T20:32:35.7649007Z ) -> None: 2025-05-07T20:32:35.7649103Z torch.manual_seed(2025) 2025-05-07T20:32:35.7649180Z 2025-05-07T20:32:35.7649347Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.7649422Z 2025-05-07T20:32:35.7649518Z x_sign = torch.sign(x) 2025-05-07T20:32:35.7649644Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.7649736Z x = x_sign * x_clamp 2025-05-07T20:32:35.7649821Z x0 = x[:, :D] 2025-05-07T20:32:35.7649902Z x1 = x[:, D:] 2025-05-07T20:32:35.7649983Z 2025-05-07T20:32:35.7650068Z if contiguous: 2025-05-07T20:32:35.7650159Z x0 = x0.contiguous() 2025-05-07T20:32:35.7650251Z x1 = x1.contiguous() 2025-05-07T20:32:35.7650322Z 2025-05-07T20:32:35.7650415Z if scale_ub is not None: 2025-05-07T20:32:35.7650577Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.7650714Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.7650789Z ) 2025-05-07T20:32:35.7650868Z else: 2025-05-07T20:32:35.7650961Z scale_ub_tensor = None 2025-05-07T20:32:35.7651032Z 2025-05-07T20:32:35.7651163Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.7651249Z op = silu_mul_quant 2025-05-07T20:32:35.7651336Z if compiled: 2025-05-07T20:32:35.7651478Z op = torch.compile(op) 2025-05-07T20:32:35.7651585Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.7651660Z 2025-05-07T20:32:35.7651790Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.7651795Z 2025-05-07T20:32:35.7651892Z moe/activation_test.py:117: 2025-05-07T20:32:35.7652025Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.7652129Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.7652231Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.7652598Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:35.7652691Z return fn(*args, **kwargs) 2025-05-07T20:32:35.7653193Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.7653289Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.7653648Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.7653880Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.7654217Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.7654310Z kernel = self.compile( 2025-05-07T20:32:35.7654756Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.7654936Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.7655066Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.7655070Z 2025-05-07T20:32:35.7655274Z self = 2025-05-07T20:32:35.7656041Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.7656558Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd096968940>} 2025-05-07T20:32:35.7657300Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.7657495Z context = 2025-05-07T20:32:35.7657500Z 2025-05-07T20:32:35.7657661Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.7657926Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.7658033Z module_map=module_map) 2025-05-07T20:32:35.7658199Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.7658299Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.7658378Z E ^ 2025-05-07T20:32:35.7658728Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.7658733Z 2025-05-07T20:32:35.7659150Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.7659201Z 2025-05-07T20:32:35.7659305Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.7659529Z self=, 2025-05-07T20:32:35.7659605Z T=1, 2025-05-07T20:32:35.7659680Z D=7168, 2025-05-07T20:32:35.7659898Z scale_ub=None, 2025-05-07T20:32:35.7659985Z contiguous=False, 2025-05-07T20:32:35.7660066Z compiled=True, 2025-05-07T20:32:35.7660139Z ) 2025-05-07T20:32:35.7660351Z self = 2025-05-07T20:32:35.7660560Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:35.7660606Z 2025-05-07T20:32:35.7660683Z @given( 2025-05-07T20:32:35.7660803Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.7660905Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.7661019Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.7661142Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.7661259Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.7661335Z ) 2025-05-07T20:32:35.7661581Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.7661677Z def test_silu_mul_quant( 2025-05-07T20:32:35.7661751Z self, 2025-05-07T20:32:35.7661825Z T: int, 2025-05-07T20:32:35.7661903Z D: int, 2025-05-07T20:32:35.7662002Z scale_ub: Optional[float], 2025-05-07T20:32:35.7662100Z contiguous: bool, 2025-05-07T20:32:35.7662186Z compiled: bool, 2025-05-07T20:32:35.7662264Z ) -> None: 2025-05-07T20:32:35.7662364Z torch.manual_seed(2025) 2025-05-07T20:32:35.7662437Z 2025-05-07T20:32:35.7662606Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.7662684Z 2025-05-07T20:32:35.7662777Z x_sign = torch.sign(x) 2025-05-07T20:32:35.7662947Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.7663039Z x = x_sign * x_clamp 2025-05-07T20:32:35.7663119Z x0 = x[:, :D] 2025-05-07T20:32:35.7663197Z x1 = x[:, D:] 2025-05-07T20:32:35.7663277Z 2025-05-07T20:32:35.7663361Z if contiguous: 2025-05-07T20:32:35.7663454Z x0 = x0.contiguous() 2025-05-07T20:32:35.7663542Z x1 = x1.contiguous() 2025-05-07T20:32:35.7663616Z 2025-05-07T20:32:35.7663714Z if scale_ub is not None: 2025-05-07T20:32:35.7663823Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.7663957Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.7664039Z ) 2025-05-07T20:32:35.7664117Z else: 2025-05-07T20:32:35.7664211Z scale_ub_tensor = None 2025-05-07T20:32:35.7664288Z 2025-05-07T20:32:35.7664417Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.7664504Z op = silu_mul_quant 2025-05-07T20:32:35.7664600Z if compiled: 2025-05-07T20:32:35.7664700Z op = torch.compile(op) 2025-05-07T20:32:35.7664814Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.7664886Z 2025-05-07T20:32:35.7664977Z y_fp8, y_scale = fn() 2025-05-07T20:32:35.7665102Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:35.7665171Z 2025-05-07T20:32:35.7665306Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.7665411Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:35.7665514Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:35.7665638Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:35.7665783Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:35.7665854Z 2025-05-07T20:32:35.7665954Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:35.7665964Z 2025-05-07T20:32:35.7666064Z moe/activation_test.py:126: 2025-05-07T20:32:35.7666243Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.7666353Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:35.7666485Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:35.7667040Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:35.7667145Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:35.7667498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.7667801Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.7668170Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:35.7668423Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:35.7668832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:35.7669080Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:35.7669455Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:35.7669621Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:35.7669962Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:35.7670043Z fn() 2025-05-07T20:32:35.7670447Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:35.7670528Z self.fn.run( 2025-05-07T20:32:35.7670865Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.7671002Z kernel = self.compile( 2025-05-07T20:32:35.7671386Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.7671564Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.7671689Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.7671693Z 2025-05-07T20:32:35.7671902Z self = 2025-05-07T20:32:35.7672680Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.7673187Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd0965a2ef0>} 2025-05-07T20:32:35.7673930Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.7674121Z context = 2025-05-07T20:32:35.7674125Z 2025-05-07T20:32:35.7674291Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.7674551Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.7674662Z module_map=module_map) 2025-05-07T20:32:35.7674826Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.7674925Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:35.7675003Z E ^ 2025-05-07T20:32:35.7675353Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.7675400Z 2025-05-07T20:32:35.7675816Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.7675823Z 2025-05-07T20:32:35.7675925Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.7676144Z self=, 2025-05-07T20:32:35.7676222Z T=1, 2025-05-07T20:32:35.7676298Z D=5120, 2025-05-07T20:32:35.7676381Z scale_ub=1200.0, 2025-05-07T20:32:35.7676470Z contiguous=False, 2025-05-07T20:32:35.7676597Z compiled=True, 2025-05-07T20:32:35.7676669Z ) 2025-05-07T20:32:35.7676926Z self = 2025-05-07T20:32:35.7677093Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:35.7677098Z 2025-05-07T20:32:35.7677177Z @given( 2025-05-07T20:32:35.7677294Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.7677398Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.7677515Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.7677631Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.7677745Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.7677822Z ) 2025-05-07T20:32:35.7678064Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.7678157Z def test_silu_mul_quant( 2025-05-07T20:32:35.7678235Z self, 2025-05-07T20:32:35.7678309Z T: int, 2025-05-07T20:32:35.7678388Z D: int, 2025-05-07T20:32:35.7678489Z scale_ub: Optional[float], 2025-05-07T20:32:35.7678582Z contiguous: bool, 2025-05-07T20:32:35.7678670Z compiled: bool, 2025-05-07T20:32:35.7678747Z ) -> None: 2025-05-07T20:32:35.7678841Z torch.manual_seed(2025) 2025-05-07T20:32:35.7678915Z 2025-05-07T20:32:35.7679082Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.7679200Z 2025-05-07T20:32:35.7679295Z x_sign = torch.sign(x) 2025-05-07T20:32:35.7679419Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.7679508Z x = x_sign * x_clamp 2025-05-07T20:32:35.7679592Z x0 = x[:, :D] 2025-05-07T20:32:35.7679671Z x1 = x[:, D:] 2025-05-07T20:32:35.7679739Z 2025-05-07T20:32:35.7679825Z if contiguous: 2025-05-07T20:32:35.7679921Z x0 = x0.contiguous() 2025-05-07T20:32:35.7680011Z x1 = x1.contiguous() 2025-05-07T20:32:35.7680086Z 2025-05-07T20:32:35.7680176Z if scale_ub is not None: 2025-05-07T20:32:35.7680283Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.7680419Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.7680494Z ) 2025-05-07T20:32:35.7680571Z else: 2025-05-07T20:32:35.7680664Z scale_ub_tensor = None 2025-05-07T20:32:35.7680735Z 2025-05-07T20:32:35.7680874Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.7680962Z op = silu_mul_quant 2025-05-07T20:32:35.7681045Z if compiled: 2025-05-07T20:32:35.7681145Z op = torch.compile(op) 2025-05-07T20:32:35.7681249Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.7681319Z 2025-05-07T20:32:35.7681412Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.7681417Z 2025-05-07T20:32:35.7681512Z moe/activation_test.py:117: 2025-05-07T20:32:35.7681639Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.7681741Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.7681840Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.7682205Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:35.7682297Z return fn(*args, **kwargs) 2025-05-07T20:32:35.7682788Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.7682935Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.7683295Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.7683522Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.7683857Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.7683990Z kernel = self.compile( 2025-05-07T20:32:35.7684439Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.7684614Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.7684737Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.7684742Z 2025-05-07T20:32:35.7684951Z self = 2025-05-07T20:32:35.7685711Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.7686203Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd0965a3eb0>} 2025-05-07T20:32:35.7686940Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.7687127Z context = 2025-05-07T20:32:35.7687132Z 2025-05-07T20:32:35.7687291Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.7687594Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.7687704Z module_map=module_map) 2025-05-07T20:32:35.7687862Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.7687959Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.7688033Z E ^ 2025-05-07T20:32:35.7688379Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.7688384Z 2025-05-07T20:32:35.7688801Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.7688809Z 2025-05-07T20:32:35.7688910Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.7689130Z self=, 2025-05-07T20:32:35.7689203Z T=1, 2025-05-07T20:32:35.7689276Z D=5120, 2025-05-07T20:32:35.7689361Z scale_ub=1200.0, 2025-05-07T20:32:35.7689446Z contiguous=False, 2025-05-07T20:32:35.7689527Z compiled=False, 2025-05-07T20:32:35.7689598Z ) 2025-05-07T20:32:35.7689808Z self = 2025-05-07T20:32:35.7690217Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:35.7690224Z 2025-05-07T20:32:35.7690303Z @given( 2025-05-07T20:32:35.7690419Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.7690522Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.7690641Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.7690760Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.7690873Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.7690943Z ) 2025-05-07T20:32:35.7691216Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.7691418Z def test_silu_mul_quant( 2025-05-07T20:32:35.7691494Z self, 2025-05-07T20:32:35.7691566Z T: int, 2025-05-07T20:32:35.7691639Z D: int, 2025-05-07T20:32:35.7691735Z scale_ub: Optional[float], 2025-05-07T20:32:35.7691825Z contiguous: bool, 2025-05-07T20:32:35.7691914Z compiled: bool, 2025-05-07T20:32:35.7691990Z ) -> None: 2025-05-07T20:32:35.7692085Z torch.manual_seed(2025) 2025-05-07T20:32:35.7692155Z 2025-05-07T20:32:35.7692319Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.7692454Z 2025-05-07T20:32:35.7692544Z x_sign = torch.sign(x) 2025-05-07T20:32:35.7692734Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.7692822Z x = x_sign * x_clamp 2025-05-07T20:32:35.7692898Z x0 = x[:, :D] 2025-05-07T20:32:35.7692976Z x1 = x[:, D:] 2025-05-07T20:32:35.7693049Z 2025-05-07T20:32:35.7693130Z if contiguous: 2025-05-07T20:32:35.7693224Z x0 = x0.contiguous() 2025-05-07T20:32:35.7693316Z x1 = x1.contiguous() 2025-05-07T20:32:35.7693383Z 2025-05-07T20:32:35.7693469Z if scale_ub is not None: 2025-05-07T20:32:35.7693577Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.7693708Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.7693786Z ) 2025-05-07T20:32:35.7693859Z else: 2025-05-07T20:32:35.7693951Z scale_ub_tensor = None 2025-05-07T20:32:35.7694024Z 2025-05-07T20:32:35.7694153Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.7694247Z op = silu_mul_quant 2025-05-07T20:32:35.7694332Z if compiled: 2025-05-07T20:32:35.7694431Z op = torch.compile(op) 2025-05-07T20:32:35.7694536Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.7694610Z 2025-05-07T20:32:35.7694698Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.7694703Z 2025-05-07T20:32:35.7694804Z moe/activation_test.py:117: 2025-05-07T20:32:35.7694986Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.7695087Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.7695184Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.7695673Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.7695765Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.7696118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.7696345Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.7696683Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.7696775Z kernel = self.compile( 2025-05-07T20:32:35.7697158Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.7697336Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.7697457Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.7697462Z 2025-05-07T20:32:35.7697658Z self = 2025-05-07T20:32:35.7698427Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.7698932Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd09696be20>} 2025-05-07T20:32:35.7699670Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.7700014Z context = 2025-05-07T20:32:35.7700020Z 2025-05-07T20:32:35.7700185Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.7700440Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.7700544Z module_map=module_map) 2025-05-07T20:32:35.7700751Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.7700846Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.7700917Z E ^ 2025-05-07T20:32:35.7701309Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.7701314Z 2025-05-07T20:32:35.7701724Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.7701731Z 2025-05-07T20:32:35.7701834Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.7702051Z self=, 2025-05-07T20:32:35.7702122Z T=16384, 2025-05-07T20:32:35.7702199Z D=5120, 2025-05-07T20:32:35.7702275Z scale_ub=1200.0, 2025-05-07T20:32:35.7702357Z contiguous=False, 2025-05-07T20:32:35.7702439Z compiled=True, 2025-05-07T20:32:35.7702507Z ) 2025-05-07T20:32:35.7702720Z self = 2025-05-07T20:32:35.7702899Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:35.7702907Z 2025-05-07T20:32:35.7702978Z @given( 2025-05-07T20:32:35.7703101Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.7703198Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.7703309Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.7703469Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.7703582Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.7703655Z ) 2025-05-07T20:32:35.7703896Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.7703986Z def test_silu_mul_quant( 2025-05-07T20:32:35.7704060Z self, 2025-05-07T20:32:35.7704130Z T: int, 2025-05-07T20:32:35.7704201Z D: int, 2025-05-07T20:32:35.7704299Z scale_ub: Optional[float], 2025-05-07T20:32:35.7704386Z contiguous: bool, 2025-05-07T20:32:35.7704469Z compiled: bool, 2025-05-07T20:32:35.7704548Z ) -> None: 2025-05-07T20:32:35.7704642Z torch.manual_seed(2025) 2025-05-07T20:32:35.7704708Z 2025-05-07T20:32:35.7704877Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.7704943Z 2025-05-07T20:32:35.7705031Z x_sign = torch.sign(x) 2025-05-07T20:32:35.7705164Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.7705248Z x = x_sign * x_clamp 2025-05-07T20:32:35.7705328Z x0 = x[:, :D] 2025-05-07T20:32:35.7705408Z x1 = x[:, D:] 2025-05-07T20:32:35.7705477Z 2025-05-07T20:32:35.7705563Z if contiguous: 2025-05-07T20:32:35.7705652Z x0 = x0.contiguous() 2025-05-07T20:32:35.7705741Z x1 = x1.contiguous() 2025-05-07T20:32:35.7705811Z 2025-05-07T20:32:35.7705899Z if scale_ub is not None: 2025-05-07T20:32:35.7706002Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.7706137Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.7706211Z ) 2025-05-07T20:32:35.7706282Z else: 2025-05-07T20:32:35.7706376Z scale_ub_tensor = None 2025-05-07T20:32:35.7706444Z 2025-05-07T20:32:35.7706575Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.7706663Z op = silu_mul_quant 2025-05-07T20:32:35.7706796Z if compiled: 2025-05-07T20:32:35.7706894Z op = torch.compile(op) 2025-05-07T20:32:35.7706997Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.7707064Z 2025-05-07T20:32:35.7707156Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.7707161Z 2025-05-07T20:32:35.7707258Z moe/activation_test.py:117: 2025-05-07T20:32:35.7710972Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.7711088Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.7711258Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.7711673Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:35.7711768Z return fn(*args, **kwargs) 2025-05-07T20:32:35.7712268Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.7712366Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.7712728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.7712953Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.7713293Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.7713385Z kernel = self.compile( 2025-05-07T20:32:35.7713767Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.7713945Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.7714076Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.7714081Z 2025-05-07T20:32:35.7714288Z self = 2025-05-07T20:32:35.7715169Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.7715681Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fcec1a0c8b0>} 2025-05-07T20:32:35.7716421Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.7716620Z context = 2025-05-07T20:32:35.7716625Z 2025-05-07T20:32:35.7716789Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.7717054Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.7717164Z module_map=module_map) 2025-05-07T20:32:35.7717326Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.7717424Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.7717498Z E ^ 2025-05-07T20:32:35.7717849Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.7717853Z 2025-05-07T20:32:35.7718264Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.7718272Z 2025-05-07T20:32:35.7718374Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.7718594Z self=, 2025-05-07T20:32:35.7718667Z T=2048, 2025-05-07T20:32:35.7718742Z D=7168, 2025-05-07T20:32:35.7718827Z scale_ub=1200.0, 2025-05-07T20:32:35.7718909Z contiguous=False, 2025-05-07T20:32:35.7719031Z compiled=True, 2025-05-07T20:32:35.7719104Z ) 2025-05-07T20:32:35.7719322Z self = 2025-05-07T20:32:35.7719493Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:35.7719498Z 2025-05-07T20:32:35.7719575Z @given( 2025-05-07T20:32:35.7719690Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.7719791Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.7719903Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.7720085Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.7720200Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.7720305Z ) 2025-05-07T20:32:35.7720555Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.7720653Z def test_silu_mul_quant( 2025-05-07T20:32:35.7720724Z self, 2025-05-07T20:32:35.7720796Z T: int, 2025-05-07T20:32:35.7720876Z D: int, 2025-05-07T20:32:35.7720975Z scale_ub: Optional[float], 2025-05-07T20:32:35.7721067Z contiguous: bool, 2025-05-07T20:32:35.7721153Z compiled: bool, 2025-05-07T20:32:35.7721233Z ) -> None: 2025-05-07T20:32:35.7721330Z torch.manual_seed(2025) 2025-05-07T20:32:35.7721399Z 2025-05-07T20:32:35.7721566Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.7721637Z 2025-05-07T20:32:35.7721728Z x_sign = torch.sign(x) 2025-05-07T20:32:35.7721850Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.7721944Z x = x_sign * x_clamp 2025-05-07T20:32:35.7722021Z x0 = x[:, :D] 2025-05-07T20:32:35.7722099Z x1 = x[:, D:] 2025-05-07T20:32:35.7722174Z 2025-05-07T20:32:35.7722259Z if contiguous: 2025-05-07T20:32:35.7722347Z x0 = x0.contiguous() 2025-05-07T20:32:35.7722438Z x1 = x1.contiguous() 2025-05-07T20:32:35.7722515Z 2025-05-07T20:32:35.7722649Z if scale_ub is not None: 2025-05-07T20:32:35.7722756Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.7722891Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.7722965Z ) 2025-05-07T20:32:35.7723039Z else: 2025-05-07T20:32:35.7723135Z scale_ub_tensor = None 2025-05-07T20:32:35.7723202Z 2025-05-07T20:32:35.7723327Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.7723416Z op = silu_mul_quant 2025-05-07T20:32:35.7723504Z if compiled: 2025-05-07T20:32:35.7723599Z op = torch.compile(op) 2025-05-07T20:32:35.7723706Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.7723779Z 2025-05-07T20:32:35.7723869Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.7723874Z 2025-05-07T20:32:35.7723970Z moe/activation_test.py:117: 2025-05-07T20:32:35.7724094Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.7724198Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.7724296Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.7724660Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:35.7724751Z return fn(*args, **kwargs) 2025-05-07T20:32:35.7725238Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.7725332Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.7725693Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.7725912Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.7726251Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.7726345Z kernel = self.compile( 2025-05-07T20:32:35.7726775Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.7726947Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.7727074Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.7727078Z 2025-05-07T20:32:35.7727281Z self = 2025-05-07T20:32:35.7728085Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.7728628Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fcec1a0d090>} 2025-05-07T20:32:35.7729371Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.7729562Z context = 2025-05-07T20:32:35.7729567Z 2025-05-07T20:32:35.7729730Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.7729992Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.7730094Z module_map=module_map) 2025-05-07T20:32:35.7730266Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.7730364Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.7730436Z E ^ 2025-05-07T20:32:35.7730788Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.7730793Z 2025-05-07T20:32:35.7731249Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.7731254Z 2025-05-07T20:32:35.7731359Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.7731575Z self=, 2025-05-07T20:32:35.7731646Z T=1, 2025-05-07T20:32:35.7731718Z D=5120, 2025-05-07T20:32:35.7731795Z scale_ub=None, 2025-05-07T20:32:35.7731880Z contiguous=False, 2025-05-07T20:32:35.7731964Z compiled=False, 2025-05-07T20:32:35.7732035Z ) 2025-05-07T20:32:35.7732245Z self = 2025-05-07T20:32:35.7732415Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:35.7732420Z 2025-05-07T20:32:35.7732492Z @given( 2025-05-07T20:32:35.7732607Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.7732706Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.7732824Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.7732943Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.7733053Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.7733123Z ) 2025-05-07T20:32:35.7733367Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.7733457Z def test_silu_mul_quant( 2025-05-07T20:32:35.7733526Z self, 2025-05-07T20:32:35.7733603Z T: int, 2025-05-07T20:32:35.7733677Z D: int, 2025-05-07T20:32:35.7733777Z scale_ub: Optional[float], 2025-05-07T20:32:35.7733864Z contiguous: bool, 2025-05-07T20:32:35.7733946Z compiled: bool, 2025-05-07T20:32:35.7734021Z ) -> None: 2025-05-07T20:32:35.7734120Z torch.manual_seed(2025) 2025-05-07T20:32:35.7734187Z 2025-05-07T20:32:35.7734350Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.7734420Z 2025-05-07T20:32:35.7734564Z x_sign = torch.sign(x) 2025-05-07T20:32:35.7734683Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.7734775Z x = x_sign * x_clamp 2025-05-07T20:32:35.7734851Z x0 = x[:, :D] 2025-05-07T20:32:35.7734926Z x1 = x[:, D:] 2025-05-07T20:32:35.7734995Z 2025-05-07T20:32:35.7735076Z if contiguous: 2025-05-07T20:32:35.7735165Z x0 = x0.contiguous() 2025-05-07T20:32:35.7735252Z x1 = x1.contiguous() 2025-05-07T20:32:35.7735322Z 2025-05-07T20:32:35.7735409Z if scale_ub is not None: 2025-05-07T20:32:35.7735558Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.7735727Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.7735798Z ) 2025-05-07T20:32:35.7735876Z else: 2025-05-07T20:32:35.7735967Z scale_ub_tensor = None 2025-05-07T20:32:35.7736040Z 2025-05-07T20:32:35.7736167Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.7736262Z op = silu_mul_quant 2025-05-07T20:32:35.7736346Z if compiled: 2025-05-07T20:32:35.7736443Z op = torch.compile(op) 2025-05-07T20:32:35.7736547Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.7736616Z 2025-05-07T20:32:35.7736706Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.7736710Z 2025-05-07T20:32:35.7736803Z moe/activation_test.py:117: 2025-05-07T20:32:35.7736931Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.7737032Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.7737130Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.7737622Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.7737717Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.7738071Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.7738337Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.7738679Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.7738775Z kernel = self.compile( 2025-05-07T20:32:35.7739156Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.7739329Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.7739454Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.7739458Z 2025-05-07T20:32:35.7739661Z self = 2025-05-07T20:32:35.7740550Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.7741055Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fcec1a0d7e0>} 2025-05-07T20:32:35.7741795Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.7741981Z context = 2025-05-07T20:32:35.7741988Z 2025-05-07T20:32:35.7742151Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.7742417Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.7742519Z module_map=module_map) 2025-05-07T20:32:35.7742683Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.7742825Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.7742896Z E ^ 2025-05-07T20:32:35.7743245Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.7743249Z 2025-05-07T20:32:35.7743654Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.7743659Z 2025-05-07T20:32:35.7743763Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.7744022Z self=, 2025-05-07T20:32:35.7744091Z T=4096, 2025-05-07T20:32:35.7744205Z D=7168, 2025-05-07T20:32:35.7744288Z scale_ub=1200.0, 2025-05-07T20:32:35.7744371Z contiguous=False, 2025-05-07T20:32:35.7744454Z compiled=False, 2025-05-07T20:32:35.7744527Z ) 2025-05-07T20:32:35.7744738Z self = 2025-05-07T20:32:35.7744918Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:35.7744923Z 2025-05-07T20:32:35.7744995Z @given( 2025-05-07T20:32:35.7745116Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.7745212Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.7745327Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.7745445Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.7745555Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.7745627Z ) 2025-05-07T20:32:35.7745872Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.7745964Z def test_silu_mul_quant( 2025-05-07T20:32:35.7746035Z self, 2025-05-07T20:32:35.7746109Z T: int, 2025-05-07T20:32:35.7746180Z D: int, 2025-05-07T20:32:35.7746279Z scale_ub: Optional[float], 2025-05-07T20:32:35.7746366Z contiguous: bool, 2025-05-07T20:32:35.7746502Z compiled: bool, 2025-05-07T20:32:35.7746580Z ) -> None: 2025-05-07T20:32:35.7746673Z torch.manual_seed(2025) 2025-05-07T20:32:35.7746740Z 2025-05-07T20:32:35.7746908Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.7746979Z 2025-05-07T20:32:35.7747069Z x_sign = torch.sign(x) 2025-05-07T20:32:35.7747194Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.7747279Z x = x_sign * x_clamp 2025-05-07T20:32:35.7747356Z x0 = x[:, :D] 2025-05-07T20:32:35.7747437Z x1 = x[:, D:] 2025-05-07T20:32:35.7747508Z 2025-05-07T20:32:35.7747588Z if contiguous: 2025-05-07T20:32:35.7747684Z x0 = x0.contiguous() 2025-05-07T20:32:35.7747770Z x1 = x1.contiguous() 2025-05-07T20:32:35.7747837Z 2025-05-07T20:32:35.7747923Z if scale_ub is not None: 2025-05-07T20:32:35.7748025Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.7748161Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.7748237Z ) 2025-05-07T20:32:35.7748307Z else: 2025-05-07T20:32:35.7748404Z scale_ub_tensor = None 2025-05-07T20:32:35.7748472Z 2025-05-07T20:32:35.7748598Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.7748688Z op = silu_mul_quant 2025-05-07T20:32:35.7748769Z if compiled: 2025-05-07T20:32:35.7748863Z op = torch.compile(op) 2025-05-07T20:32:35.7748969Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.7749038Z 2025-05-07T20:32:35.7749129Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.7749136Z 2025-05-07T20:32:35.7749230Z moe/activation_test.py:117: 2025-05-07T20:32:35.7749354Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.7749455Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.7749550Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.7750121Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.7750220Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.7750570Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.7750793Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.7751126Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.7751258Z kernel = self.compile( 2025-05-07T20:32:35.7751682Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.7751855Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.7751976Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.7751992Z 2025-05-07T20:32:35.7752193Z self = 2025-05-07T20:32:35.7752956Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.7753457Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fcec1a0e200>} 2025-05-07T20:32:35.7754197Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.7754388Z context = 2025-05-07T20:32:35.7754392Z 2025-05-07T20:32:35.7754599Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.7754863Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.7754969Z module_map=module_map) 2025-05-07T20:32:35.7755130Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.7755226Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.7755298Z E ^ 2025-05-07T20:32:35.7755646Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.7755653Z 2025-05-07T20:32:35.7756070Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.7756075Z 2025-05-07T20:32:35.7756176Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.7756392Z self=, 2025-05-07T20:32:35.7756471Z T=16384, 2025-05-07T20:32:35.7756547Z D=7168, 2025-05-07T20:32:35.7756627Z scale_ub=None, 2025-05-07T20:32:35.7756707Z contiguous=True, 2025-05-07T20:32:35.7756788Z compiled=True, 2025-05-07T20:32:35.7756859Z ) 2025-05-07T20:32:35.7757071Z self = 2025-05-07T20:32:35.7757242Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:35.7757246Z 2025-05-07T20:32:35.7757320Z @given( 2025-05-07T20:32:35.7757436Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.7757534Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.7757655Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.7757770Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.7757883Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.7757954Z ) 2025-05-07T20:32:35.7758198Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.7758335Z def test_silu_mul_quant( 2025-05-07T20:32:35.7758409Z self, 2025-05-07T20:32:35.7758484Z T: int, 2025-05-07T20:32:35.7758557Z D: int, 2025-05-07T20:32:35.7758650Z scale_ub: Optional[float], 2025-05-07T20:32:35.7758737Z contiguous: bool, 2025-05-07T20:32:35.7758823Z compiled: bool, 2025-05-07T20:32:35.7758898Z ) -> None: 2025-05-07T20:32:35.7758990Z torch.manual_seed(2025) 2025-05-07T20:32:35.7759062Z 2025-05-07T20:32:35.7759269Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.7759338Z 2025-05-07T20:32:35.7759464Z x_sign = torch.sign(x) 2025-05-07T20:32:35.7759590Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.7759677Z x = x_sign * x_clamp 2025-05-07T20:32:35.7759756Z x0 = x[:, :D] 2025-05-07T20:32:35.7759834Z x1 = x[:, D:] 2025-05-07T20:32:35.7759907Z 2025-05-07T20:32:35.7759992Z if contiguous: 2025-05-07T20:32:35.7760081Z x0 = x0.contiguous() 2025-05-07T20:32:35.7760168Z x1 = x1.contiguous() 2025-05-07T20:32:35.7760239Z 2025-05-07T20:32:35.7760328Z if scale_ub is not None: 2025-05-07T20:32:35.7760433Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.7760564Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.7760637Z ) 2025-05-07T20:32:35.7760710Z else: 2025-05-07T20:32:35.7760802Z scale_ub_tensor = None 2025-05-07T20:32:35.7760876Z 2025-05-07T20:32:35.7761001Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.7761088Z op = silu_mul_quant 2025-05-07T20:32:35.7761172Z if compiled: 2025-05-07T20:32:35.7761267Z op = torch.compile(op) 2025-05-07T20:32:35.7761369Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.7761439Z 2025-05-07T20:32:35.7761529Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.7761576Z 2025-05-07T20:32:35.7761672Z moe/activation_test.py:117: 2025-05-07T20:32:35.7761800Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.7761896Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.7761996Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.7762354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:35.7762446Z return fn(*args, **kwargs) 2025-05-07T20:32:35.7762934Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.7763029Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.7763384Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.7763603Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.7763947Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.7764039Z kernel = self.compile( 2025-05-07T20:32:35.7764420Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.7764591Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.7764717Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.7764723Z 2025-05-07T20:32:35.7764928Z self = 2025-05-07T20:32:35.7765693Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.7766201Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fcec1a0f760>} 2025-05-07T20:32:35.7766991Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.7767183Z context = 2025-05-07T20:32:35.7767188Z 2025-05-07T20:32:35.7767351Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.7767696Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.7767801Z module_map=module_map) 2025-05-07T20:32:35.7767968Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.7768071Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.7768146Z E ^ 2025-05-07T20:32:35.7768498Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.7768503Z 2025-05-07T20:32:35.7768915Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.7768920Z 2025-05-07T20:32:35.7769019Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.7769239Z self=, 2025-05-07T20:32:35.7769312Z T=4096, 2025-05-07T20:32:35.7769389Z D=5120, 2025-05-07T20:32:35.7769470Z scale_ub=None, 2025-05-07T20:32:35.7769553Z contiguous=False, 2025-05-07T20:32:35.7769635Z compiled=True, 2025-05-07T20:32:35.7769706Z ) 2025-05-07T20:32:35.7769918Z self = 2025-05-07T20:32:35.7770090Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:35.7770098Z 2025-05-07T20:32:35.7770213Z @given( 2025-05-07T20:32:35.7770330Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.7770433Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.7770545Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.7770659Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.7770772Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.7770843Z ) 2025-05-07T20:32:35.7771089Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.7771182Z def test_silu_mul_quant( 2025-05-07T20:32:35.7771252Z self, 2025-05-07T20:32:35.7771330Z T: int, 2025-05-07T20:32:35.7771409Z D: int, 2025-05-07T20:32:35.7771512Z scale_ub: Optional[float], 2025-05-07T20:32:35.7771598Z contiguous: bool, 2025-05-07T20:32:35.7771681Z compiled: bool, 2025-05-07T20:32:35.7771764Z ) -> None: 2025-05-07T20:32:35.7771858Z torch.manual_seed(2025) 2025-05-07T20:32:35.7771933Z 2025-05-07T20:32:35.7772101Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.7772169Z 2025-05-07T20:32:35.7772260Z x_sign = torch.sign(x) 2025-05-07T20:32:35.7772388Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.7772475Z x = x_sign * x_clamp 2025-05-07T20:32:35.7772553Z x0 = x[:, :D] 2025-05-07T20:32:35.7772633Z x1 = x[:, D:] 2025-05-07T20:32:35.7772699Z 2025-05-07T20:32:35.7772784Z if contiguous: 2025-05-07T20:32:35.7772875Z x0 = x0.contiguous() 2025-05-07T20:32:35.7772962Z x1 = x1.contiguous() 2025-05-07T20:32:35.7773032Z 2025-05-07T20:32:35.7773119Z if scale_ub is not None: 2025-05-07T20:32:35.7773221Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.7773353Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.7773425Z ) 2025-05-07T20:32:35.7773548Z else: 2025-05-07T20:32:35.7773643Z scale_ub_tensor = None 2025-05-07T20:32:35.7773713Z 2025-05-07T20:32:35.7773841Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.7773928Z op = silu_mul_quant 2025-05-07T20:32:35.7774010Z if compiled: 2025-05-07T20:32:35.7774104Z op = torch.compile(op) 2025-05-07T20:32:35.7774212Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.7774281Z 2025-05-07T20:32:35.7774374Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.7774421Z 2025-05-07T20:32:35.7774517Z moe/activation_test.py:117: 2025-05-07T20:32:35.7774681Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.7774785Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.7774880Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.7775239Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:35.7775338Z return fn(*args, **kwargs) 2025-05-07T20:32:35.7775830Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.7775924Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.7776274Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.7776491Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.7776835Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.7776927Z kernel = self.compile( 2025-05-07T20:32:35.7777310Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.7777490Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.7777680Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.7777685Z 2025-05-07T20:32:35.7777892Z self = 2025-05-07T20:32:35.7778655Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.7779150Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd0960f0280>} 2025-05-07T20:32:35.7780007Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.7780194Z context = 2025-05-07T20:32:35.7780203Z 2025-05-07T20:32:35.7780368Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.7780629Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.7780738Z module_map=module_map) 2025-05-07T20:32:35.7780896Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.7781017Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.7781095Z E ^ 2025-05-07T20:32:35.7781464Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.7781471Z 2025-05-07T20:32:35.7781883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.7781890Z 2025-05-07T20:32:35.7781992Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.7782210Z self=, 2025-05-07T20:32:35.7782331Z T=4096, 2025-05-07T20:32:35.7782402Z D=5120, 2025-05-07T20:32:35.7782482Z scale_ub=1200.0, 2025-05-07T20:32:35.7782565Z contiguous=False, 2025-05-07T20:32:35.7782645Z compiled=False, 2025-05-07T20:32:35.7782713Z ) 2025-05-07T20:32:35.7782926Z self = 2025-05-07T20:32:35.7783096Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:35.7783100Z 2025-05-07T20:32:35.7783216Z @given( 2025-05-07T20:32:35.7783331Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.7783467Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.7783585Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.7783702Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.7783812Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.7783888Z ) 2025-05-07T20:32:35.7784130Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.7784220Z def test_silu_mul_quant( 2025-05-07T20:32:35.7784300Z self, 2025-05-07T20:32:35.7784370Z T: int, 2025-05-07T20:32:35.7784440Z D: int, 2025-05-07T20:32:35.7784539Z scale_ub: Optional[float], 2025-05-07T20:32:35.7784627Z contiguous: bool, 2025-05-07T20:32:35.7784713Z compiled: bool, 2025-05-07T20:32:35.7784787Z ) -> None: 2025-05-07T20:32:35.7784877Z torch.manual_seed(2025) 2025-05-07T20:32:35.7784948Z 2025-05-07T20:32:35.7785114Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.7785189Z 2025-05-07T20:32:35.7785280Z x_sign = torch.sign(x) 2025-05-07T20:32:35.7785402Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.7785487Z x = x_sign * x_clamp 2025-05-07T20:32:35.7785566Z x0 = x[:, :D] 2025-05-07T20:32:35.7785646Z x1 = x[:, D:] 2025-05-07T20:32:35.7785758Z 2025-05-07T20:32:35.7785850Z if contiguous: 2025-05-07T20:32:35.7785939Z x0 = x0.contiguous() 2025-05-07T20:32:35.7786028Z x1 = x1.contiguous() 2025-05-07T20:32:35.7786096Z 2025-05-07T20:32:35.7786184Z if scale_ub is not None: 2025-05-07T20:32:35.7786291Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.7786422Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.7786493Z ) 2025-05-07T20:32:35.7786571Z else: 2025-05-07T20:32:35.7786660Z scale_ub_tensor = None 2025-05-07T20:32:35.7786727Z 2025-05-07T20:32:35.7786862Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.7786949Z op = silu_mul_quant 2025-05-07T20:32:35.7787031Z if compiled: 2025-05-07T20:32:35.7787129Z op = torch.compile(op) 2025-05-07T20:32:35.7787231Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.7787304Z 2025-05-07T20:32:35.7787391Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.7787396Z 2025-05-07T20:32:35.7787490Z moe/activation_test.py:117: 2025-05-07T20:32:35.7787617Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.7787714Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.7787808Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.7788307Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.7788404Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.7788765Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.7788984Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.7789325Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.7789472Z kernel = self.compile( 2025-05-07T20:32:35.7790058Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.7790300Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.7790429Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.7790433Z 2025-05-07T20:32:35.7790632Z self = 2025-05-07T20:32:35.7791549Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.7792051Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd0960f1000>} 2025-05-07T20:32:35.7792795Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.7792982Z context = 2025-05-07T20:32:35.7792986Z 2025-05-07T20:32:35.7793145Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.7793406Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.7793513Z module_map=module_map) 2025-05-07T20:32:35.7793673Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.7793769Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.7793840Z E ^ 2025-05-07T20:32:35.7794193Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.7794263Z 2025-05-07T20:32:35.7794677Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.7794682Z 2025-05-07T20:32:35.7794783Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.7795003Z self=, 2025-05-07T20:32:35.7795077Z T=4096, 2025-05-07T20:32:35.7795150Z D=5120, 2025-05-07T20:32:35.7795230Z scale_ub=1200.0, 2025-05-07T20:32:35.7795315Z contiguous=False, 2025-05-07T20:32:35.7795399Z compiled=True, 2025-05-07T20:32:35.7795465Z ) 2025-05-07T20:32:35.7795680Z self = 2025-05-07T20:32:35.7795855Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:35.7795860Z 2025-05-07T20:32:35.7795932Z @given( 2025-05-07T20:32:35.7796044Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.7796149Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.7796260Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.7796374Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.7796483Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.7796552Z ) 2025-05-07T20:32:35.7796793Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.7796886Z def test_silu_mul_quant( 2025-05-07T20:32:35.7796958Z self, 2025-05-07T20:32:35.7797033Z T: int, 2025-05-07T20:32:35.7797105Z D: int, 2025-05-07T20:32:35.7797200Z scale_ub: Optional[float], 2025-05-07T20:32:35.7797291Z contiguous: bool, 2025-05-07T20:32:35.7797370Z compiled: bool, 2025-05-07T20:32:35.7797446Z ) -> None: 2025-05-07T20:32:35.7797539Z torch.manual_seed(2025) 2025-05-07T20:32:35.7797606Z 2025-05-07T20:32:35.7797773Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.7797912Z 2025-05-07T20:32:35.7798000Z x_sign = torch.sign(x) 2025-05-07T20:32:35.7798127Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.7798212Z x = x_sign * x_clamp 2025-05-07T20:32:35.7798289Z x0 = x[:, :D] 2025-05-07T20:32:35.7798365Z x1 = x[:, D:] 2025-05-07T20:32:35.7798430Z 2025-05-07T20:32:35.7798509Z if contiguous: 2025-05-07T20:32:35.7798601Z x0 = x0.contiguous() 2025-05-07T20:32:35.7798687Z x1 = x1.contiguous() 2025-05-07T20:32:35.7798796Z 2025-05-07T20:32:35.7798890Z if scale_ub is not None: 2025-05-07T20:32:35.7799033Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.7799168Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.7799243Z ) 2025-05-07T20:32:35.7799318Z else: 2025-05-07T20:32:35.7799412Z scale_ub_tensor = None 2025-05-07T20:32:35.7799482Z 2025-05-07T20:32:35.7799612Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.7799702Z op = silu_mul_quant 2025-05-07T20:32:35.7799786Z if compiled: 2025-05-07T20:32:35.7799881Z op = torch.compile(op) 2025-05-07T20:32:35.7799989Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.7800057Z 2025-05-07T20:32:35.7800145Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.7800149Z 2025-05-07T20:32:35.7800248Z moe/activation_test.py:117: 2025-05-07T20:32:35.7800371Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.7800476Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.7800575Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.7800933Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:35.7801025Z return fn(*args, **kwargs) 2025-05-07T20:32:35.7801558Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.7801654Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.7802007Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.7802222Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.7802563Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.7802655Z kernel = self.compile( 2025-05-07T20:32:35.7803030Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.7803206Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.7803326Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.7803333Z 2025-05-07T20:32:35.7803538Z self = 2025-05-07T20:32:35.7804307Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.7804796Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd0960f0700>} 2025-05-07T20:32:35.7805535Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.7805721Z context = 2025-05-07T20:32:35.7805726Z 2025-05-07T20:32:35.7805893Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.7806202Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.7806306Z module_map=module_map) 2025-05-07T20:32:35.7806475Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.7806569Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.7806640Z E ^ 2025-05-07T20:32:35.7806990Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.7807035Z 2025-05-07T20:32:35.7807509Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.7807514Z 2025-05-07T20:32:35.7807617Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.7807832Z self=, 2025-05-07T20:32:35.7807902Z T=2048, 2025-05-07T20:32:35.7807977Z D=7168, 2025-05-07T20:32:35.7808057Z scale_ub=1200.0, 2025-05-07T20:32:35.7808140Z contiguous=False, 2025-05-07T20:32:35.7808223Z compiled=False, 2025-05-07T20:32:35.7808289Z ) 2025-05-07T20:32:35.7808502Z self = 2025-05-07T20:32:35.7808673Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:35.7808677Z 2025-05-07T20:32:35.7808748Z @given( 2025-05-07T20:32:35.7808868Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.7808966Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.7809076Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.7809193Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.7809305Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.7809378Z ) 2025-05-07T20:32:35.7809622Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.7809760Z def test_silu_mul_quant( 2025-05-07T20:32:35.7809838Z self, 2025-05-07T20:32:35.7809910Z T: int, 2025-05-07T20:32:35.7809983Z D: int, 2025-05-07T20:32:35.7810082Z scale_ub: Optional[float], 2025-05-07T20:32:35.7810167Z contiguous: bool, 2025-05-07T20:32:35.7810249Z compiled: bool, 2025-05-07T20:32:35.7810326Z ) -> None: 2025-05-07T20:32:35.7810418Z torch.manual_seed(2025) 2025-05-07T20:32:35.7810486Z 2025-05-07T20:32:35.7810654Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.7810727Z 2025-05-07T20:32:35.7810814Z x_sign = torch.sign(x) 2025-05-07T20:32:35.7810950Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.7811051Z x = x_sign * x_clamp 2025-05-07T20:32:35.7811140Z x0 = x[:, :D] 2025-05-07T20:32:35.7811229Z x1 = x[:, D:] 2025-05-07T20:32:35.7811296Z 2025-05-07T20:32:35.7811377Z if contiguous: 2025-05-07T20:32:35.7811471Z x0 = x0.contiguous() 2025-05-07T20:32:35.7811559Z x1 = x1.contiguous() 2025-05-07T20:32:35.7811633Z 2025-05-07T20:32:35.7811719Z if scale_ub is not None: 2025-05-07T20:32:35.7811822Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.7811955Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.7812027Z ) 2025-05-07T20:32:35.7812098Z else: 2025-05-07T20:32:35.7812194Z scale_ub_tensor = None 2025-05-07T20:32:35.7812261Z 2025-05-07T20:32:35.7812392Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.7812479Z op = silu_mul_quant 2025-05-07T20:32:35.7812564Z if compiled: 2025-05-07T20:32:35.7812666Z op = torch.compile(op) 2025-05-07T20:32:35.7812771Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.7812838Z 2025-05-07T20:32:35.7812930Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.7812979Z 2025-05-07T20:32:35.7813076Z moe/activation_test.py:117: 2025-05-07T20:32:35.7813201Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.7813301Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.7813397Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.7813888Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.7813980Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.7814332Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.7814636Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.7814980Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.7815071Z kernel = self.compile( 2025-05-07T20:32:35.7815463Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.7815635Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.7815763Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.7815768Z 2025-05-07T20:32:35.7815966Z self = 2025-05-07T20:32:35.7816728Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.7817238Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd0960f1240>} 2025-05-07T20:32:35.7818013Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.7818206Z context = 2025-05-07T20:32:35.7818211Z 2025-05-07T20:32:35.7818369Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.7818628Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.7818730Z module_map=module_map) 2025-05-07T20:32:35.7818895Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.7818992Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.7819069Z E ^ 2025-05-07T20:32:35.7819417Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.7819421Z 2025-05-07T20:32:35.7819968Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.7819975Z 2025-05-07T20:32:35.7820076Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.7820294Z self=, 2025-05-07T20:32:35.7820370Z T=1, 2025-05-07T20:32:35.7820445Z D=7168, 2025-05-07T20:32:35.7820524Z scale_ub=None, 2025-05-07T20:32:35.7820605Z contiguous=True, 2025-05-07T20:32:35.7820686Z compiled=False, 2025-05-07T20:32:35.7820757Z ) 2025-05-07T20:32:35.7820966Z self = 2025-05-07T20:32:35.7821134Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:35.7821142Z 2025-05-07T20:32:35.7821215Z @given( 2025-05-07T20:32:35.7821330Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.7821428Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.7821540Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.7821705Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.7821820Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.7821889Z ) 2025-05-07T20:32:35.7822128Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.7822222Z def test_silu_mul_quant( 2025-05-07T20:32:35.7822293Z self, 2025-05-07T20:32:35.7822365Z T: int, 2025-05-07T20:32:35.7822444Z D: int, 2025-05-07T20:32:35.7822541Z scale_ub: Optional[float], 2025-05-07T20:32:35.7822672Z contiguous: bool, 2025-05-07T20:32:35.7822754Z compiled: bool, 2025-05-07T20:32:35.7822828Z ) -> None: 2025-05-07T20:32:35.7822963Z torch.manual_seed(2025) 2025-05-07T20:32:35.7823032Z 2025-05-07T20:32:35.7823197Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.7823269Z 2025-05-07T20:32:35.7823356Z x_sign = torch.sign(x) 2025-05-07T20:32:35.7823482Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.7823572Z x = x_sign * x_clamp 2025-05-07T20:32:35.7823644Z x0 = x[:, :D] 2025-05-07T20:32:35.7823720Z x1 = x[:, D:] 2025-05-07T20:32:35.7823794Z 2025-05-07T20:32:35.7823873Z if contiguous: 2025-05-07T20:32:35.7823963Z x0 = x0.contiguous() 2025-05-07T20:32:35.7824054Z x1 = x1.contiguous() 2025-05-07T20:32:35.7824120Z 2025-05-07T20:32:35.7824211Z if scale_ub is not None: 2025-05-07T20:32:35.7824318Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.7824447Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.7824523Z ) 2025-05-07T20:32:35.7824596Z else: 2025-05-07T20:32:35.7824687Z scale_ub_tensor = None 2025-05-07T20:32:35.7824760Z 2025-05-07T20:32:35.7824885Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.7824973Z op = silu_mul_quant 2025-05-07T20:32:35.7825103Z if compiled: 2025-05-07T20:32:35.7825204Z op = torch.compile(op) 2025-05-07T20:32:35.7825311Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.7825382Z 2025-05-07T20:32:35.7825470Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.7825475Z 2025-05-07T20:32:35.7825570Z moe/activation_test.py:117: 2025-05-07T20:32:35.7825694Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.7825791Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.7825891Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.7826382Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.7826477Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.7826832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.7827057Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.7827400Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.7827494Z kernel = self.compile( 2025-05-07T20:32:35.7827872Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.7828044Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.7828166Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.7828173Z 2025-05-07T20:32:35.7828381Z self = 2025-05-07T20:32:35.7829144Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.7832485Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd0960f2050>} 2025-05-07T20:32:35.7833260Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.7833455Z context = 2025-05-07T20:32:35.7833528Z 2025-05-07T20:32:35.7833694Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.7833999Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.7834108Z module_map=module_map) 2025-05-07T20:32:35.7834266Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.7834368Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.7834452Z E ^ 2025-05-07T20:32:35.7834803Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.7834808Z 2025-05-07T20:32:35.7835222Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.7835227Z 2025-05-07T20:32:35.7835326Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.7835545Z self=, 2025-05-07T20:32:35.7835627Z T=16384, 2025-05-07T20:32:35.7835699Z D=7168, 2025-05-07T20:32:35.7835782Z scale_ub=1200.0, 2025-05-07T20:32:35.7835865Z contiguous=False, 2025-05-07T20:32:35.7835944Z compiled=True, 2025-05-07T20:32:35.7836014Z ) 2025-05-07T20:32:35.7836233Z self = 2025-05-07T20:32:35.7836450Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:35.7836455Z 2025-05-07T20:32:35.7836530Z @given( 2025-05-07T20:32:35.7836649Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.7836746Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.7836861Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.7836974Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.7837085Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.7837162Z ) 2025-05-07T20:32:35.7837412Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.7837508Z def test_silu_mul_quant( 2025-05-07T20:32:35.7837580Z self, 2025-05-07T20:32:35.7837650Z T: int, 2025-05-07T20:32:35.7837724Z D: int, 2025-05-07T20:32:35.7837820Z scale_ub: Optional[float], 2025-05-07T20:32:35.7837906Z contiguous: bool, 2025-05-07T20:32:35.7837993Z compiled: bool, 2025-05-07T20:32:35.7838074Z ) -> None: 2025-05-07T20:32:35.7838168Z torch.manual_seed(2025) 2025-05-07T20:32:35.7838241Z 2025-05-07T20:32:35.7838407Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.7838475Z 2025-05-07T20:32:35.7838566Z x_sign = torch.sign(x) 2025-05-07T20:32:35.7838688Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.7838776Z x = x_sign * x_clamp 2025-05-07T20:32:35.7838854Z x0 = x[:, :D] 2025-05-07T20:32:35.7838933Z x1 = x[:, D:] 2025-05-07T20:32:35.7839007Z 2025-05-07T20:32:35.7839087Z if contiguous: 2025-05-07T20:32:35.7839178Z x0 = x0.contiguous() 2025-05-07T20:32:35.7839266Z x1 = x1.contiguous() 2025-05-07T20:32:35.7839332Z 2025-05-07T20:32:35.7839418Z if scale_ub is not None: 2025-05-07T20:32:35.7839527Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.7839665Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.7839782Z ) 2025-05-07T20:32:35.7839856Z else: 2025-05-07T20:32:35.7839947Z scale_ub_tensor = None 2025-05-07T20:32:35.7840013Z 2025-05-07T20:32:35.7840145Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.7840232Z op = silu_mul_quant 2025-05-07T20:32:35.7840317Z if compiled: 2025-05-07T20:32:35.7840415Z op = torch.compile(op) 2025-05-07T20:32:35.7840517Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.7840655Z 2025-05-07T20:32:35.7840745Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.7840750Z 2025-05-07T20:32:35.7840972Z moe/activation_test.py:117: 2025-05-07T20:32:35.7841122Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.7841243Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.7841338Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.7841716Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:35.7841810Z return fn(*args, **kwargs) 2025-05-07T20:32:35.7842299Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.7842395Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.7842748Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.7842973Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.7843311Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.7843406Z kernel = self.compile( 2025-05-07T20:32:35.7843782Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.7843999Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.7844128Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.7844133Z 2025-05-07T20:32:35.7844337Z self = 2025-05-07T20:32:35.7845101Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.7845601Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd0960f3490>} 2025-05-07T20:32:35.7846334Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.7846532Z context = 2025-05-07T20:32:35.7846537Z 2025-05-07T20:32:35.7846702Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.7846963Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.7847067Z module_map=module_map) 2025-05-07T20:32:35.7847228Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.7847326Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.7847401Z E ^ 2025-05-07T20:32:35.7847750Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.7847754Z 2025-05-07T20:32:35.7848170Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.7848175Z 2025-05-07T20:32:35.7848318Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.7848540Z self=, 2025-05-07T20:32:35.7848613Z T=1, 2025-05-07T20:32:35.7848683Z D=7168, 2025-05-07T20:32:35.7848764Z scale_ub=None, 2025-05-07T20:32:35.7848846Z contiguous=False, 2025-05-07T20:32:35.7848927Z compiled=False, 2025-05-07T20:32:35.7848998Z ) 2025-05-07T20:32:35.7849212Z self = 2025-05-07T20:32:35.7849377Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:35.7849424Z 2025-05-07T20:32:35.7849498Z @given( 2025-05-07T20:32:35.7849652Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.7849753Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.7849869Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.7849985Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.7850108Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.7850180Z ) 2025-05-07T20:32:35.7850420Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.7850513Z def test_silu_mul_quant( 2025-05-07T20:32:35.7850584Z self, 2025-05-07T20:32:35.7850659Z T: int, 2025-05-07T20:32:35.7850733Z D: int, 2025-05-07T20:32:35.7850832Z scale_ub: Optional[float], 2025-05-07T20:32:35.7850924Z contiguous: bool, 2025-05-07T20:32:35.7851005Z compiled: bool, 2025-05-07T20:32:35.7851084Z ) -> None: 2025-05-07T20:32:35.7851179Z torch.manual_seed(2025) 2025-05-07T20:32:35.7851247Z 2025-05-07T20:32:35.7851416Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.7851490Z 2025-05-07T20:32:35.7851579Z x_sign = torch.sign(x) 2025-05-07T20:32:35.7851702Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.7851788Z x = x_sign * x_clamp 2025-05-07T20:32:35.7851910Z x0 = x[:, :D] 2025-05-07T20:32:35.7851989Z x1 = x[:, D:] 2025-05-07T20:32:35.7852061Z 2025-05-07T20:32:35.7852142Z if contiguous: 2025-05-07T20:32:35.7852235Z x0 = x0.contiguous() 2025-05-07T20:32:35.7852323Z x1 = x1.contiguous() 2025-05-07T20:32:35.7852389Z 2025-05-07T20:32:35.7852479Z if scale_ub is not None: 2025-05-07T20:32:35.7852582Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.7852715Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.7852796Z ) 2025-05-07T20:32:35.7852866Z else: 2025-05-07T20:32:35.7852957Z scale_ub_tensor = None 2025-05-07T20:32:35.7853031Z 2025-05-07T20:32:35.7853168Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.7853253Z op = silu_mul_quant 2025-05-07T20:32:35.7853335Z if compiled: 2025-05-07T20:32:35.7853436Z op = torch.compile(op) 2025-05-07T20:32:35.7853547Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.7853613Z 2025-05-07T20:32:35.7853707Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.7853711Z 2025-05-07T20:32:35.7853805Z moe/activation_test.py:117: 2025-05-07T20:32:35.7853933Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.7854030Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.7854124Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.7854615Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.7854712Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.7855063Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.7855281Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.7855670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.7855765Z kernel = self.compile( 2025-05-07T20:32:35.7856139Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.7856311Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.7856436Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.7856440Z 2025-05-07T20:32:35.7856686Z self = 2025-05-07T20:32:35.7857493Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.7857996Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd0960f37f0>} 2025-05-07T20:32:35.7858734Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.7858925Z context = 2025-05-07T20:32:35.7858929Z 2025-05-07T20:32:35.7859094Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.7859362Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.7859468Z module_map=module_map) 2025-05-07T20:32:35.7859629Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.7859728Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.7859934Z E ^ 2025-05-07T20:32:35.7860323Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.7860335Z 2025-05-07T20:32:35.7860749Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.7860754Z 2025-05-07T20:32:35.7860854Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.7861073Z self=, 2025-05-07T20:32:35.7861145Z T=2048, 2025-05-07T20:32:35.7861215Z D=7168, 2025-05-07T20:32:35.7861300Z scale_ub=None, 2025-05-07T20:32:35.7861382Z contiguous=False, 2025-05-07T20:32:35.7861461Z compiled=True, 2025-05-07T20:32:35.7861538Z ) 2025-05-07T20:32:35.7861748Z self = 2025-05-07T20:32:35.7861922Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:35.7861926Z 2025-05-07T20:32:35.7861999Z @given( 2025-05-07T20:32:35.7862115Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.7862216Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.7862328Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.7862444Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.7862555Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.7862624Z ) 2025-05-07T20:32:35.7862874Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.7862964Z def test_silu_mul_quant( 2025-05-07T20:32:35.7863038Z self, 2025-05-07T20:32:35.7863112Z T: int, 2025-05-07T20:32:35.7863185Z D: int, 2025-05-07T20:32:35.7863282Z scale_ub: Optional[float], 2025-05-07T20:32:35.7863371Z contiguous: bool, 2025-05-07T20:32:35.7863453Z compiled: bool, 2025-05-07T20:32:35.7863524Z ) -> None: 2025-05-07T20:32:35.7863618Z torch.manual_seed(2025) 2025-05-07T20:32:35.7863733Z 2025-05-07T20:32:35.7863902Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.7863974Z 2025-05-07T20:32:35.7864064Z x_sign = torch.sign(x) 2025-05-07T20:32:35.7864184Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.7864273Z x = x_sign * x_clamp 2025-05-07T20:32:35.7864350Z x0 = x[:, :D] 2025-05-07T20:32:35.7864428Z x1 = x[:, D:] 2025-05-07T20:32:35.7864496Z 2025-05-07T20:32:35.7864576Z if contiguous: 2025-05-07T20:32:35.7864711Z x0 = x0.contiguous() 2025-05-07T20:32:35.7864795Z x1 = x1.contiguous() 2025-05-07T20:32:35.7864862Z 2025-05-07T20:32:35.7864993Z if scale_ub is not None: 2025-05-07T20:32:35.7865096Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.7865227Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.7865303Z ) 2025-05-07T20:32:35.7865375Z else: 2025-05-07T20:32:35.7865472Z scale_ub_tensor = None 2025-05-07T20:32:35.7865544Z 2025-05-07T20:32:35.7865672Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.7865760Z op = silu_mul_quant 2025-05-07T20:32:35.7865843Z if compiled: 2025-05-07T20:32:35.7865941Z op = torch.compile(op) 2025-05-07T20:32:35.7866045Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.7866114Z 2025-05-07T20:32:35.7866202Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.7866206Z 2025-05-07T20:32:35.7866306Z moe/activation_test.py:117: 2025-05-07T20:32:35.7866431Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.7866531Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.7866629Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.7866989Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:35.7867086Z return fn(*args, **kwargs) 2025-05-07T20:32:35.7867621Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.7867716Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.7868076Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.7868292Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.7868625Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.7868722Z kernel = self.compile( 2025-05-07T20:32:35.7869099Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.7869273Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.7869398Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.7869405Z 2025-05-07T20:32:35.7869608Z self = 2025-05-07T20:32:35.7870374Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.7870876Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fcec1b50af0>} 2025-05-07T20:32:35.7871675Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.7871863Z context = 2025-05-07T20:32:35.7871934Z 2025-05-07T20:32:35.7872104Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.7872359Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.7872463Z module_map=module_map) 2025-05-07T20:32:35.7872625Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.7872720Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.7872791Z E ^ 2025-05-07T20:32:35.7873140Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.7873185Z 2025-05-07T20:32:35.7873637Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.7873642Z 2025-05-07T20:32:35.7873749Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.7873967Z self=, 2025-05-07T20:32:35.7874042Z T=4096, 2025-05-07T20:32:35.7874116Z D=7168, 2025-05-07T20:32:35.7874191Z scale_ub=None, 2025-05-07T20:32:35.7874274Z contiguous=False, 2025-05-07T20:32:35.7874355Z compiled=True, 2025-05-07T20:32:35.7874421Z ) 2025-05-07T20:32:35.7874630Z self = 2025-05-07T20:32:35.7874804Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:35.7874808Z 2025-05-07T20:32:35.7874881Z @given( 2025-05-07T20:32:35.7874999Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.7875095Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.7875209Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.7875329Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.7875437Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.7875506Z ) 2025-05-07T20:32:35.7875791Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.7875887Z def test_silu_mul_quant( 2025-05-07T20:32:35.7875959Z self, 2025-05-07T20:32:35.7876030Z T: int, 2025-05-07T20:32:35.7876103Z D: int, 2025-05-07T20:32:35.7876202Z scale_ub: Optional[float], 2025-05-07T20:32:35.7876287Z contiguous: bool, 2025-05-07T20:32:35.7876368Z compiled: bool, 2025-05-07T20:32:35.7876444Z ) -> None: 2025-05-07T20:32:35.7876535Z torch.manual_seed(2025) 2025-05-07T20:32:35.7876601Z 2025-05-07T20:32:35.7876773Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.7876842Z 2025-05-07T20:32:35.7876930Z x_sign = torch.sign(x) 2025-05-07T20:32:35.7877054Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.7877139Z x = x_sign * x_clamp 2025-05-07T20:32:35.7877216Z x0 = x[:, :D] 2025-05-07T20:32:35.7877296Z x1 = x[:, D:] 2025-05-07T20:32:35.7877365Z 2025-05-07T20:32:35.7877449Z if contiguous: 2025-05-07T20:32:35.7877536Z x0 = x0.contiguous() 2025-05-07T20:32:35.7877621Z x1 = x1.contiguous() 2025-05-07T20:32:35.7877689Z 2025-05-07T20:32:35.7877775Z if scale_ub is not None: 2025-05-07T20:32:35.7877877Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.7878010Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.7878080Z ) 2025-05-07T20:32:35.7878150Z else: 2025-05-07T20:32:35.7878241Z scale_ub_tensor = None 2025-05-07T20:32:35.7878317Z 2025-05-07T20:32:35.7878444Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.7878536Z op = silu_mul_quant 2025-05-07T20:32:35.7878619Z if compiled: 2025-05-07T20:32:35.7878719Z op = torch.compile(op) 2025-05-07T20:32:35.7878822Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.7878889Z 2025-05-07T20:32:35.7879029Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.7879033Z 2025-05-07T20:32:35.7879129Z moe/activation_test.py:117: 2025-05-07T20:32:35.7879254Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.7879355Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.7879451Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.7879818Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:35.7879910Z return fn(*args, **kwargs) 2025-05-07T20:32:35.7880484Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.7880582Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.7880938Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.7881161Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.7881505Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.7881595Z kernel = self.compile( 2025-05-07T20:32:35.7881978Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.7882149Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.7882269Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.7882277Z 2025-05-07T20:32:35.7882485Z self = 2025-05-07T20:32:35.7883248Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.7883786Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fcec1b50280>} 2025-05-07T20:32:35.7884534Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.7884722Z context = 2025-05-07T20:32:35.7884726Z 2025-05-07T20:32:35.7884894Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.7885152Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.7885257Z module_map=module_map) 2025-05-07T20:32:35.7885417Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.7885509Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.7885588Z E ^ 2025-05-07T20:32:35.7885936Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.7885940Z 2025-05-07T20:32:35.7886354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.7886358Z 2025-05-07T20:32:35.7886456Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.7886677Z self=, 2025-05-07T20:32:35.7886757Z T=16384, 2025-05-07T20:32:35.7886829Z D=5120, 2025-05-07T20:32:35.7886909Z scale_ub=1200.0, 2025-05-07T20:32:35.7886994Z contiguous=False, 2025-05-07T20:32:35.7887074Z compiled=False, 2025-05-07T20:32:35.7887142Z ) 2025-05-07T20:32:35.7887356Z self = 2025-05-07T20:32:35.7887531Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:35.7887583Z 2025-05-07T20:32:35.7887661Z @given( 2025-05-07T20:32:35.7887777Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.7887872Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.7887987Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.7888101Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.7888214Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.7888290Z ) 2025-05-07T20:32:35.7888537Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.7888667Z def test_silu_mul_quant( 2025-05-07T20:32:35.7888740Z self, 2025-05-07T20:32:35.7888849Z T: int, 2025-05-07T20:32:35.7888923Z D: int, 2025-05-07T20:32:35.7889019Z scale_ub: Optional[float], 2025-05-07T20:32:35.7889105Z contiguous: bool, 2025-05-07T20:32:35.7889190Z compiled: bool, 2025-05-07T20:32:35.7889263Z ) -> None: 2025-05-07T20:32:35.7889361Z torch.manual_seed(2025) 2025-05-07T20:32:35.7889432Z 2025-05-07T20:32:35.7889598Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.7889664Z 2025-05-07T20:32:35.7889754Z x_sign = torch.sign(x) 2025-05-07T20:32:35.7890494Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.7890625Z x = x_sign * x_clamp 2025-05-07T20:32:35.7890718Z x0 = x[:, :D] 2025-05-07T20:32:35.7890798Z x1 = x[:, D:] 2025-05-07T20:32:35.7890880Z 2025-05-07T20:32:35.7890974Z if contiguous: 2025-05-07T20:32:35.7891064Z x0 = x0.contiguous() 2025-05-07T20:32:35.7891159Z x1 = x1.contiguous() 2025-05-07T20:32:35.7891229Z 2025-05-07T20:32:35.7891319Z if scale_ub is not None: 2025-05-07T20:32:35.7891430Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.7891567Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.7891648Z ) 2025-05-07T20:32:35.7891866Z else: 2025-05-07T20:32:35.7891965Z scale_ub_tensor = None 2025-05-07T20:32:35.7892039Z 2025-05-07T20:32:35.7892173Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.7892262Z op = silu_mul_quant 2025-05-07T20:32:35.7892349Z if compiled: 2025-05-07T20:32:35.7892448Z op = torch.compile(op) 2025-05-07T20:32:35.7892555Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.7892629Z 2025-05-07T20:32:35.7892720Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.7892729Z 2025-05-07T20:32:35.7892826Z moe/activation_test.py:117: 2025-05-07T20:32:35.7892960Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.7893059Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.7893164Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.7893669Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.7893767Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.7894130Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.7894349Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.7894690Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.7894781Z kernel = self.compile( 2025-05-07T20:32:35.7895169Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.7895347Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.7895472Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.7895476Z 2025-05-07T20:32:35.7895683Z self = 2025-05-07T20:32:35.7896519Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.7897017Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fcec1b52d40>} 2025-05-07T20:32:35.7897828Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.7898089Z context = 2025-05-07T20:32:35.7898094Z 2025-05-07T20:32:35.7898261Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.7898526Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.7898631Z module_map=module_map) 2025-05-07T20:32:35.7898793Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.7898888Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.7898963Z E ^ 2025-05-07T20:32:35.7899317Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.7899322Z 2025-05-07T20:32:35.7899740Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.7899744Z 2025-05-07T20:32:35.7900033Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.7900357Z self=, 2025-05-07T20:32:35.7900461Z T=16384, 2025-05-07T20:32:35.7900568Z D=5120, 2025-05-07T20:32:35.7900674Z scale_ub=1200.0, 2025-05-07T20:32:35.7900821Z contiguous=True, 2025-05-07T20:32:35.7900903Z compiled=True, 2025-05-07T20:32:35.7900975Z ) 2025-05-07T20:32:35.7901214Z self = 2025-05-07T20:32:35.7901409Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:35.7901414Z 2025-05-07T20:32:35.7901487Z @given( 2025-05-07T20:32:35.7901610Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.7901704Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.7901819Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.7901936Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.7902049Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.7902118Z ) 2025-05-07T20:32:35.7902361Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.7902450Z def test_silu_mul_quant( 2025-05-07T20:32:35.7902529Z self, 2025-05-07T20:32:35.7902603Z T: int, 2025-05-07T20:32:35.7902676Z D: int, 2025-05-07T20:32:35.7902774Z scale_ub: Optional[float], 2025-05-07T20:32:35.7902860Z contiguous: bool, 2025-05-07T20:32:35.7902944Z compiled: bool, 2025-05-07T20:32:35.7903025Z ) -> None: 2025-05-07T20:32:35.7903114Z torch.manual_seed(2025) 2025-05-07T20:32:35.7903184Z 2025-05-07T20:32:35.7903353Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.7903428Z 2025-05-07T20:32:35.7903522Z x_sign = torch.sign(x) 2025-05-07T20:32:35.7903645Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.7903733Z x = x_sign * x_clamp 2025-05-07T20:32:35.7903810Z x0 = x[:, :D] 2025-05-07T20:32:35.7903886Z x1 = x[:, D:] 2025-05-07T20:32:35.7903956Z 2025-05-07T20:32:35.7904040Z if contiguous: 2025-05-07T20:32:35.7904130Z x0 = x0.contiguous() 2025-05-07T20:32:35.7904269Z x1 = x1.contiguous() 2025-05-07T20:32:35.7904342Z 2025-05-07T20:32:35.7904433Z if scale_ub is not None: 2025-05-07T20:32:35.7904538Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.7904677Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.7904751Z ) 2025-05-07T20:32:35.7904826Z else: 2025-05-07T20:32:35.7904919Z scale_ub_tensor = None 2025-05-07T20:32:35.7904986Z 2025-05-07T20:32:35.7905109Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.7905246Z op = silu_mul_quant 2025-05-07T20:32:35.7905328Z if compiled: 2025-05-07T20:32:35.7905466Z op = torch.compile(op) 2025-05-07T20:32:35.7905571Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.7905644Z 2025-05-07T20:32:35.7905742Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.7905747Z 2025-05-07T20:32:35.7905842Z moe/activation_test.py:117: 2025-05-07T20:32:35.7905970Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.7906069Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.7906167Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.7906534Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:35.7906630Z return fn(*args, **kwargs) 2025-05-07T20:32:35.7907116Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.7907217Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.7907570Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.7907786Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.7908172Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.7908269Z kernel = self.compile( 2025-05-07T20:32:35.7908650Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.7908821Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.7908942Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.7908946Z 2025-05-07T20:32:35.7909150Z self = 2025-05-07T20:32:35.7909916Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.7910428Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fcec1b52830>} 2025-05-07T20:32:35.7911164Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.7911350Z context = 2025-05-07T20:32:35.7911357Z 2025-05-07T20:32:35.7911517Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.7911774Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.7911890Z module_map=module_map) 2025-05-07T20:32:35.7912053Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.7912147Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.7912223Z E ^ 2025-05-07T20:32:35.7912575Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.7912622Z 2025-05-07T20:32:35.7913037Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.7913041Z 2025-05-07T20:32:35.7913142Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.7913358Z self=, 2025-05-07T20:32:35.7913435Z T=16384, 2025-05-07T20:32:35.7913507Z D=5120, 2025-05-07T20:32:35.7913583Z scale_ub=None, 2025-05-07T20:32:35.7913712Z contiguous=False, 2025-05-07T20:32:35.7913791Z compiled=True, 2025-05-07T20:32:35.7913858Z ) 2025-05-07T20:32:35.7914109Z self = 2025-05-07T20:32:35.7914282Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:35.7914287Z 2025-05-07T20:32:35.7914362Z @given( 2025-05-07T20:32:35.7914477Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.7914578Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.7914696Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.7914814Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.7914927Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.7915000Z ) 2025-05-07T20:32:35.7915246Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.7915339Z def test_silu_mul_quant( 2025-05-07T20:32:35.7915412Z self, 2025-05-07T20:32:35.7915488Z T: int, 2025-05-07T20:32:35.7915564Z D: int, 2025-05-07T20:32:35.7915662Z scale_ub: Optional[float], 2025-05-07T20:32:35.7915748Z contiguous: bool, 2025-05-07T20:32:35.7915833Z compiled: bool, 2025-05-07T20:32:35.7915908Z ) -> None: 2025-05-07T20:32:35.7915998Z torch.manual_seed(2025) 2025-05-07T20:32:35.7916070Z 2025-05-07T20:32:35.7916282Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.7916353Z 2025-05-07T20:32:35.7916446Z x_sign = torch.sign(x) 2025-05-07T20:32:35.7916566Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.7916652Z x = x_sign * x_clamp 2025-05-07T20:32:35.7916732Z x0 = x[:, :D] 2025-05-07T20:32:35.7916810Z x1 = x[:, D:] 2025-05-07T20:32:35.7916884Z 2025-05-07T20:32:35.7916966Z if contiguous: 2025-05-07T20:32:35.7917054Z x0 = x0.contiguous() 2025-05-07T20:32:35.7917146Z x1 = x1.contiguous() 2025-05-07T20:32:35.7917216Z 2025-05-07T20:32:35.7917303Z if scale_ub is not None: 2025-05-07T20:32:35.7917413Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.7917541Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.7917611Z ) 2025-05-07T20:32:35.7917688Z else: 2025-05-07T20:32:35.7917780Z scale_ub_tensor = None 2025-05-07T20:32:35.7917852Z 2025-05-07T20:32:35.7917982Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.7918069Z op = silu_mul_quant 2025-05-07T20:32:35.7918151Z if compiled: 2025-05-07T20:32:35.7918249Z op = torch.compile(op) 2025-05-07T20:32:35.7918351Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.7918424Z 2025-05-07T20:32:35.7918512Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.7918516Z 2025-05-07T20:32:35.7918611Z moe/activation_test.py:117: 2025-05-07T20:32:35.7918742Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.7918841Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.7918940Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.7919301Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:35.7919390Z return fn(*args, **kwargs) 2025-05-07T20:32:35.7919934Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.7920033Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.7920390Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.7920610Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.7920945Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.7921103Z kernel = self.compile( 2025-05-07T20:32:35.7921527Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.7921701Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.7921825Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.7921832Z 2025-05-07T20:32:35.7922036Z self = 2025-05-07T20:32:35.7922798Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.7923303Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fcec1b53760>} 2025-05-07T20:32:35.7924042Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.7924230Z context = 2025-05-07T20:32:35.7924235Z 2025-05-07T20:32:35.7924396Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.7924700Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.7924804Z module_map=module_map) 2025-05-07T20:32:35.7924963Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.7925061Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.7925135Z E ^ 2025-05-07T20:32:35.7925482Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.7925489Z 2025-05-07T20:32:35.7925898Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.7925903Z 2025-05-07T20:32:35.7926005Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.7926222Z self=, 2025-05-07T20:32:35.7926295Z T=2048, 2025-05-07T20:32:35.7926372Z D=5120, 2025-05-07T20:32:35.7926452Z scale_ub=None, 2025-05-07T20:32:35.7926534Z contiguous=False, 2025-05-07T20:32:35.7926611Z compiled=True, 2025-05-07T20:32:35.7926685Z ) 2025-05-07T20:32:35.7926895Z self = 2025-05-07T20:32:35.7927062Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:35.7927071Z 2025-05-07T20:32:35.7927143Z @given( 2025-05-07T20:32:35.7927257Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.7927359Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.7927472Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.7927588Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.7927700Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.7927771Z ) 2025-05-07T20:32:35.7928012Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.7928154Z def test_silu_mul_quant( 2025-05-07T20:32:35.7928224Z self, 2025-05-07T20:32:35.7928295Z T: int, 2025-05-07T20:32:35.7928370Z D: int, 2025-05-07T20:32:35.7928464Z scale_ub: Optional[float], 2025-05-07T20:32:35.7928554Z contiguous: bool, 2025-05-07T20:32:35.7928635Z compiled: bool, 2025-05-07T20:32:35.7928709Z ) -> None: 2025-05-07T20:32:35.7928804Z torch.manual_seed(2025) 2025-05-07T20:32:35.7928874Z 2025-05-07T20:32:35.7929035Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.7929153Z 2025-05-07T20:32:35.7929242Z x_sign = torch.sign(x) 2025-05-07T20:32:35.7929402Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.7929494Z x = x_sign * x_clamp 2025-05-07T20:32:35.7929571Z x0 = x[:, :D] 2025-05-07T20:32:35.7929646Z x1 = x[:, D:] 2025-05-07T20:32:35.7929717Z 2025-05-07T20:32:35.7929800Z if contiguous: 2025-05-07T20:32:35.7929892Z x0 = x0.contiguous() 2025-05-07T20:32:35.7929981Z x1 = x1.contiguous() 2025-05-07T20:32:35.7930050Z 2025-05-07T20:32:35.7930140Z if scale_ub is not None: 2025-05-07T20:32:35.7930244Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.7930375Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.7930453Z ) 2025-05-07T20:32:35.7930526Z else: 2025-05-07T20:32:35.7930617Z scale_ub_tensor = None 2025-05-07T20:32:35.7930688Z 2025-05-07T20:32:35.7930816Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.7930902Z op = silu_mul_quant 2025-05-07T20:32:35.7931002Z if compiled: 2025-05-07T20:32:35.7931117Z op = torch.compile(op) 2025-05-07T20:32:35.7931236Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.7931317Z 2025-05-07T20:32:35.7931408Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.7931414Z 2025-05-07T20:32:35.7931559Z moe/activation_test.py:117: 2025-05-07T20:32:35.7931685Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.7931785Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.7931885Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.7932244Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:35.7932336Z return fn(*args, **kwargs) 2025-05-07T20:32:35.7932832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.7932933Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.7933291Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.7933510Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.7933851Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.7933951Z kernel = self.compile( 2025-05-07T20:32:35.7934332Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.7934510Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.7934632Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.7934636Z 2025-05-07T20:32:35.7934841Z self = 2025-05-07T20:32:35.7935608Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.7936113Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fcec164c3a0>} 2025-05-07T20:32:35.7936910Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.7937095Z context = 2025-05-07T20:32:35.7937100Z 2025-05-07T20:32:35.7937260Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.7937562Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.7937705Z module_map=module_map) 2025-05-07T20:32:35.7937878Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.7937972Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.7938046Z E ^ 2025-05-07T20:32:35.7938405Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.7938410Z 2025-05-07T20:32:35.7938822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.7938827Z 2025-05-07T20:32:35.7938933Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.7939150Z self=, 2025-05-07T20:32:35.7939221Z T=2048, 2025-05-07T20:32:35.7939297Z D=5120, 2025-05-07T20:32:35.7939378Z scale_ub=1200.0, 2025-05-07T20:32:35.7939460Z contiguous=False, 2025-05-07T20:32:35.7939541Z compiled=True, 2025-05-07T20:32:35.7939611Z ) 2025-05-07T20:32:35.7939905Z self = 2025-05-07T20:32:35.7940096Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:35.7940101Z 2025-05-07T20:32:35.7940180Z @given( 2025-05-07T20:32:35.7940342Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.7940446Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.7940565Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.7940693Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.7940814Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.7940886Z ) 2025-05-07T20:32:35.7941175Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.7941274Z def test_silu_mul_quant( 2025-05-07T20:32:35.7941349Z self, 2025-05-07T20:32:35.7941426Z T: int, 2025-05-07T20:32:35.7941503Z D: int, 2025-05-07T20:32:35.7941603Z scale_ub: Optional[float], 2025-05-07T20:32:35.7941700Z contiguous: bool, 2025-05-07T20:32:35.7941785Z compiled: bool, 2025-05-07T20:32:35.7941864Z ) -> None: 2025-05-07T20:32:35.7941959Z torch.manual_seed(2025) 2025-05-07T20:32:35.7942031Z 2025-05-07T20:32:35.7942219Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.7942290Z 2025-05-07T20:32:35.7942381Z x_sign = torch.sign(x) 2025-05-07T20:32:35.7942514Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.7942603Z x = x_sign * x_clamp 2025-05-07T20:32:35.7942684Z x0 = x[:, :D] 2025-05-07T20:32:35.7942765Z x1 = x[:, D:] 2025-05-07T20:32:35.7942835Z 2025-05-07T20:32:35.7942919Z if contiguous: 2025-05-07T20:32:35.7943017Z x0 = x0.contiguous() 2025-05-07T20:32:35.7943106Z x1 = x1.contiguous() 2025-05-07T20:32:35.7943179Z 2025-05-07T20:32:35.7943277Z if scale_ub is not None: 2025-05-07T20:32:35.7943385Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.7943531Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.7943605Z ) 2025-05-07T20:32:35.7943726Z else: 2025-05-07T20:32:35.7943829Z scale_ub_tensor = None 2025-05-07T20:32:35.7943899Z 2025-05-07T20:32:35.7944033Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.7944126Z op = silu_mul_quant 2025-05-07T20:32:35.7944214Z if compiled: 2025-05-07T20:32:35.7944315Z op = torch.compile(op) 2025-05-07T20:32:35.7944427Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.7944499Z 2025-05-07T20:32:35.7944591Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.7944600Z 2025-05-07T20:32:35.7944744Z moe/activation_test.py:117: 2025-05-07T20:32:35.7944920Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.7945028Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.7945129Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.7945572Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:35.7945677Z return fn(*args, **kwargs) 2025-05-07T20:32:35.7946279Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.7946378Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.7946812Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.7947069Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.7947482Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.7947579Z kernel = self.compile( 2025-05-07T20:32:35.7948040Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.7948236Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.7948414Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.7948422Z 2025-05-07T20:32:35.7948656Z self = 2025-05-07T20:32:35.7949629Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.7950252Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fcec164c820>} 2025-05-07T20:32:35.7951188Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.7951402Z context = 2025-05-07T20:32:35.7951409Z 2025-05-07T20:32:35.7951598Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.7951904Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.7952015Z module_map=module_map) 2025-05-07T20:32:35.7952191Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.7952291Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.7952370Z E ^ 2025-05-07T20:32:35.7952792Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.7952798Z 2025-05-07T20:32:35.7953299Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.7953304Z 2025-05-07T20:32:35.7956400Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.7956646Z self=, 2025-05-07T20:32:35.7956849Z T=4096, 2025-05-07T20:32:35.7956925Z D=5120, 2025-05-07T20:32:35.7957007Z scale_ub=1200.0, 2025-05-07T20:32:35.7957093Z contiguous=True, 2025-05-07T20:32:35.7957175Z compiled=True, 2025-05-07T20:32:35.7957250Z ) 2025-05-07T20:32:35.7957470Z self = 2025-05-07T20:32:35.7957644Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:35.7957649Z 2025-05-07T20:32:35.7957725Z @given( 2025-05-07T20:32:35.7957889Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.7957987Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.7958141Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.7958259Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.7958373Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.7958449Z ) 2025-05-07T20:32:35.7958699Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.7958800Z def test_silu_mul_quant( 2025-05-07T20:32:35.7958877Z self, 2025-05-07T20:32:35.7958952Z T: int, 2025-05-07T20:32:35.7959031Z D: int, 2025-05-07T20:32:35.7959128Z scale_ub: Optional[float], 2025-05-07T20:32:35.7959216Z contiguous: bool, 2025-05-07T20:32:35.7959301Z compiled: bool, 2025-05-07T20:32:35.7959379Z ) -> None: 2025-05-07T20:32:35.7959474Z torch.manual_seed(2025) 2025-05-07T20:32:35.7959553Z 2025-05-07T20:32:35.7959719Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.7959791Z 2025-05-07T20:32:35.7959886Z x_sign = torch.sign(x) 2025-05-07T20:32:35.7960011Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.7960103Z x = x_sign * x_clamp 2025-05-07T20:32:35.7960180Z x0 = x[:, :D] 2025-05-07T20:32:35.7960258Z x1 = x[:, D:] 2025-05-07T20:32:35.7960333Z 2025-05-07T20:32:35.7960460Z if contiguous: 2025-05-07T20:32:35.7960554Z x0 = x0.contiguous() 2025-05-07T20:32:35.7960644Z x1 = x1.contiguous() 2025-05-07T20:32:35.7960715Z 2025-05-07T20:32:35.7960806Z if scale_ub is not None: 2025-05-07T20:32:35.7960914Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.7961047Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.7961120Z ) 2025-05-07T20:32:35.7961203Z else: 2025-05-07T20:32:35.7961302Z scale_ub_tensor = None 2025-05-07T20:32:35.7961374Z 2025-05-07T20:32:35.7961503Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.7961592Z op = silu_mul_quant 2025-05-07T20:32:35.7961679Z if compiled: 2025-05-07T20:32:35.7961777Z op = torch.compile(op) 2025-05-07T20:32:35.7961881Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.7961962Z 2025-05-07T20:32:35.7962057Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.7962061Z 2025-05-07T20:32:35.7962158Z moe/activation_test.py:117: 2025-05-07T20:32:35.7962292Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.7962393Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.7962491Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.7962863Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:35.7962958Z return fn(*args, **kwargs) 2025-05-07T20:32:35.7963462Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.7963559Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.7963916Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.7964144Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.7964531Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.7964625Z kernel = self.compile( 2025-05-07T20:32:35.7965006Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.7965181Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.7965307Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.7965352Z 2025-05-07T20:32:35.7965563Z self = 2025-05-07T20:32:35.7966373Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.7966884Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fcec164d360>} 2025-05-07T20:32:35.7967621Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.7967813Z context = 2025-05-07T20:32:35.7967818Z 2025-05-07T20:32:35.7967985Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.7968252Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.7968357Z module_map=module_map) 2025-05-07T20:32:35.7968516Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.7968619Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.7968698Z E ^ 2025-05-07T20:32:35.7969093Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.7969102Z 2025-05-07T20:32:35.7969518Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.7969523Z 2025-05-07T20:32:35.7969624Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.7969847Z self=, 2025-05-07T20:32:35.7969925Z T=128, 2025-05-07T20:32:35.7970000Z D=5120, 2025-05-07T20:32:35.7970085Z scale_ub=1200.0, 2025-05-07T20:32:35.7970171Z contiguous=False, 2025-05-07T20:32:35.7970253Z compiled=True, 2025-05-07T20:32:35.7970328Z ) 2025-05-07T20:32:35.7970539Z self = 2025-05-07T20:32:35.7970713Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:35.7970722Z 2025-05-07T20:32:35.7970795Z @given( 2025-05-07T20:32:35.7970913Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.7971011Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.7971138Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.7971270Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.7971410Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.7971482Z ) 2025-05-07T20:32:35.7971724Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.7971824Z def test_silu_mul_quant( 2025-05-07T20:32:35.7971901Z self, 2025-05-07T20:32:35.7971981Z T: int, 2025-05-07T20:32:35.7972057Z D: int, 2025-05-07T20:32:35.7972154Z scale_ub: Optional[float], 2025-05-07T20:32:35.7972252Z contiguous: bool, 2025-05-07T20:32:35.7972337Z compiled: bool, 2025-05-07T20:32:35.7972458Z ) -> None: 2025-05-07T20:32:35.7972558Z torch.manual_seed(2025) 2025-05-07T20:32:35.7972631Z 2025-05-07T20:32:35.7972796Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.7972869Z 2025-05-07T20:32:35.7972958Z x_sign = torch.sign(x) 2025-05-07T20:32:35.7973082Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.7973173Z x = x_sign * x_clamp 2025-05-07T20:32:35.7973255Z x0 = x[:, :D] 2025-05-07T20:32:35.7973339Z x1 = x[:, D:] 2025-05-07T20:32:35.7973409Z 2025-05-07T20:32:35.7973534Z if contiguous: 2025-05-07T20:32:35.7973626Z x0 = x0.contiguous() 2025-05-07T20:32:35.7973753Z x1 = x1.contiguous() 2025-05-07T20:32:35.7973824Z 2025-05-07T20:32:35.7973919Z if scale_ub is not None: 2025-05-07T20:32:35.7974025Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.7974158Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.7974237Z ) 2025-05-07T20:32:35.7974312Z else: 2025-05-07T20:32:35.7974406Z scale_ub_tensor = None 2025-05-07T20:32:35.7974479Z 2025-05-07T20:32:35.7974606Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.7974693Z op = silu_mul_quant 2025-05-07T20:32:35.7974784Z if compiled: 2025-05-07T20:32:35.7974883Z op = torch.compile(op) 2025-05-07T20:32:35.7974989Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.7975058Z 2025-05-07T20:32:35.7975152Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.7975157Z 2025-05-07T20:32:35.7975259Z moe/activation_test.py:117: 2025-05-07T20:32:35.7975387Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.7975487Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.7975587Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.7976008Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:35.7976103Z return fn(*args, **kwargs) 2025-05-07T20:32:35.7976599Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.7976700Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.7977059Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.7977278Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.7977626Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.7977719Z kernel = self.compile( 2025-05-07T20:32:35.7978098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.7978269Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.7978401Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.7978405Z 2025-05-07T20:32:35.7978610Z self = 2025-05-07T20:32:35.7979374Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.7979987Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fcec164e290>} 2025-05-07T20:32:35.7980730Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.7980971Z context = 2025-05-07T20:32:35.7980976Z 2025-05-07T20:32:35.7981136Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.7981394Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.7981502Z module_map=module_map) 2025-05-07T20:32:35.7981665Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.7981762Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.7981832Z E ^ 2025-05-07T20:32:35.7982227Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.7982269Z 2025-05-07T20:32:35.7982689Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.7982694Z 2025-05-07T20:32:35.7982794Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.7983019Z self=, 2025-05-07T20:32:35.7983093Z T=16384, 2025-05-07T20:32:35.7983167Z D=7168, 2025-05-07T20:32:35.7983251Z scale_ub=1200.0, 2025-05-07T20:32:35.7983332Z contiguous=True, 2025-05-07T20:32:35.7983409Z compiled=True, 2025-05-07T20:32:35.7983484Z ) 2025-05-07T20:32:35.7983697Z self = 2025-05-07T20:32:35.7983874Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:35.7983883Z 2025-05-07T20:32:35.7983955Z @given( 2025-05-07T20:32:35.7984070Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.7984168Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.7984286Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.7984399Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.7984514Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.7984588Z ) 2025-05-07T20:32:35.7984899Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.7984993Z def test_silu_mul_quant( 2025-05-07T20:32:35.7985063Z self, 2025-05-07T20:32:35.7985136Z T: int, 2025-05-07T20:32:35.7985215Z D: int, 2025-05-07T20:32:35.7985310Z scale_ub: Optional[float], 2025-05-07T20:32:35.7985402Z contiguous: bool, 2025-05-07T20:32:35.7985488Z compiled: bool, 2025-05-07T20:32:35.7985565Z ) -> None: 2025-05-07T20:32:35.7985667Z torch.manual_seed(2025) 2025-05-07T20:32:35.7985737Z 2025-05-07T20:32:35.7985904Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.7985984Z 2025-05-07T20:32:35.7986077Z x_sign = torch.sign(x) 2025-05-07T20:32:35.7986197Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.7986285Z x = x_sign * x_clamp 2025-05-07T20:32:35.7986366Z x0 = x[:, :D] 2025-05-07T20:32:35.7986445Z x1 = x[:, D:] 2025-05-07T20:32:35.7986521Z 2025-05-07T20:32:35.7986601Z if contiguous: 2025-05-07T20:32:35.7986691Z x0 = x0.contiguous() 2025-05-07T20:32:35.7986787Z x1 = x1.contiguous() 2025-05-07T20:32:35.7986858Z 2025-05-07T20:32:35.7986953Z if scale_ub is not None: 2025-05-07T20:32:35.7987063Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.7987198Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.7987278Z ) 2025-05-07T20:32:35.7987357Z else: 2025-05-07T20:32:35.7987451Z scale_ub_tensor = None 2025-05-07T20:32:35.7987527Z 2025-05-07T20:32:35.7987656Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.7987744Z op = silu_mul_quant 2025-05-07T20:32:35.7987835Z if compiled: 2025-05-07T20:32:35.7987933Z op = torch.compile(op) 2025-05-07T20:32:35.7988089Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.7988169Z 2025-05-07T20:32:35.7988263Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.7988267Z 2025-05-07T20:32:35.7988373Z moe/activation_test.py:117: 2025-05-07T20:32:35.7988503Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.7988605Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.7988708Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.7989072Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:35.7989208Z return fn(*args, **kwargs) 2025-05-07T20:32:35.7989744Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.7990077Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.7990482Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.7990713Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.7991057Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.7991153Z kernel = self.compile( 2025-05-07T20:32:35.7991534Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.7991709Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.7991840Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.7991844Z 2025-05-07T20:32:35.7992054Z self = 2025-05-07T20:32:35.7992912Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.7993422Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fcec164ed40>} 2025-05-07T20:32:35.7994164Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.7994355Z context = 2025-05-07T20:32:35.7994361Z 2025-05-07T20:32:35.7994528Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.7994793Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.7994898Z module_map=module_map) 2025-05-07T20:32:35.7995061Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.7995167Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.7995245Z E ^ 2025-05-07T20:32:35.7995601Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.7995606Z 2025-05-07T20:32:35.7996023Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.7996027Z 2025-05-07T20:32:35.7996131Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.7996353Z self=, 2025-05-07T20:32:35.7996433Z T=16384, 2025-05-07T20:32:35.7996513Z D=5120, 2025-05-07T20:32:35.7996600Z scale_ub=1200.0, 2025-05-07T20:32:35.7996688Z contiguous=True, 2025-05-07T20:32:35.7996774Z compiled=False, 2025-05-07T20:32:35.7996845Z ) 2025-05-07T20:32:35.7997060Z self = 2025-05-07T20:32:35.7997311Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:35.7997316Z 2025-05-07T20:32:35.7997392Z @given( 2025-05-07T20:32:35.7997509Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.7997612Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.7997726Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.7997846Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.7997957Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.7998105Z ) 2025-05-07T20:32:35.7998354Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.7998500Z def test_silu_mul_quant( 2025-05-07T20:32:35.7998578Z self, 2025-05-07T20:32:35.7998659Z T: int, 2025-05-07T20:32:35.7998734Z D: int, 2025-05-07T20:32:35.7998833Z scale_ub: Optional[float], 2025-05-07T20:32:35.7998930Z contiguous: bool, 2025-05-07T20:32:35.7999024Z compiled: bool, 2025-05-07T20:32:35.7999102Z ) -> None: 2025-05-07T20:32:35.7999199Z torch.manual_seed(2025) 2025-05-07T20:32:35.7999273Z 2025-05-07T20:32:35.7999437Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.7999512Z 2025-05-07T20:32:35.7999602Z x_sign = torch.sign(x) 2025-05-07T20:32:35.7999731Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.7999820Z x = x_sign * x_clamp 2025-05-07T20:32:35.7999902Z x0 = x[:, :D] 2025-05-07T20:32:35.7999987Z x1 = x[:, D:] 2025-05-07T20:32:35.8000058Z 2025-05-07T20:32:35.8000142Z if contiguous: 2025-05-07T20:32:35.8000242Z x0 = x0.contiguous() 2025-05-07T20:32:35.8000331Z x1 = x1.contiguous() 2025-05-07T20:32:35.8000402Z 2025-05-07T20:32:35.8000493Z if scale_ub is not None: 2025-05-07T20:32:35.8000600Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.8000778Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.8000858Z ) 2025-05-07T20:32:35.8000934Z else: 2025-05-07T20:32:35.8001028Z scale_ub_tensor = None 2025-05-07T20:32:35.8001099Z 2025-05-07T20:32:35.8001226Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.8001318Z op = silu_mul_quant 2025-05-07T20:32:35.8001404Z if compiled: 2025-05-07T20:32:35.8001504Z op = torch.compile(op) 2025-05-07T20:32:35.8001613Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.8001686Z 2025-05-07T20:32:35.8001777Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.8001781Z 2025-05-07T20:32:35.8001885Z moe/activation_test.py:117: 2025-05-07T20:32:35.8002018Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.8002121Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.8002220Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.8002722Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.8002825Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.8003184Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.8003405Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.8003745Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.8003842Z kernel = self.compile( 2025-05-07T20:32:35.8004236Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.8004412Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.8004537Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.8004586Z 2025-05-07T20:32:35.8004798Z self = 2025-05-07T20:32:35.8005564Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.8006080Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fcec164fac0>} 2025-05-07T20:32:35.8006906Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.8007096Z context = 2025-05-07T20:32:35.8007104Z 2025-05-07T20:32:35.8007272Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.8007537Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.8007645Z module_map=module_map) 2025-05-07T20:32:35.8007804Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.8007903Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.8007984Z E ^ 2025-05-07T20:32:35.8008340Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.8008347Z 2025-05-07T20:32:35.8008767Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.8008772Z 2025-05-07T20:32:35.8008877Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.8009097Z self=, 2025-05-07T20:32:35.8009180Z T=1, 2025-05-07T20:32:35.8009295Z D=7168, 2025-05-07T20:32:35.8009380Z scale_ub=1200.0, 2025-05-07T20:32:35.8009467Z contiguous=False, 2025-05-07T20:32:35.8009550Z compiled=False, 2025-05-07T20:32:35.8009623Z ) 2025-05-07T20:32:35.8009839Z self = 2025-05-07T20:32:35.8010004Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:35.8010009Z 2025-05-07T20:32:35.8010086Z @given( 2025-05-07T20:32:35.8010202Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.8010303Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.8010426Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.8010541Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.8010656Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.8010734Z ) 2025-05-07T20:32:35.8010977Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.8011075Z def test_silu_mul_quant( 2025-05-07T20:32:35.8011158Z self, 2025-05-07T20:32:35.8011235Z T: int, 2025-05-07T20:32:35.8011315Z D: int, 2025-05-07T20:32:35.8011415Z scale_ub: Optional[float], 2025-05-07T20:32:35.8011503Z contiguous: bool, 2025-05-07T20:32:35.8011591Z compiled: bool, 2025-05-07T20:32:35.8011668Z ) -> None: 2025-05-07T20:32:35.8011763Z torch.manual_seed(2025) 2025-05-07T20:32:35.8011840Z 2025-05-07T20:32:35.8012008Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.8012086Z 2025-05-07T20:32:35.8012181Z x_sign = torch.sign(x) 2025-05-07T20:32:35.8012307Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.8012393Z x = x_sign * x_clamp 2025-05-07T20:32:35.8012479Z x0 = x[:, :D] 2025-05-07T20:32:35.8012556Z x1 = x[:, D:] 2025-05-07T20:32:35.8012632Z 2025-05-07T20:32:35.8012762Z if contiguous: 2025-05-07T20:32:35.8012857Z x0 = x0.contiguous() 2025-05-07T20:32:35.8012951Z x1 = x1.contiguous() 2025-05-07T20:32:35.8013022Z 2025-05-07T20:32:35.8013110Z if scale_ub is not None: 2025-05-07T20:32:35.8013220Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.8013352Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.8013427Z ) 2025-05-07T20:32:35.8013505Z else: 2025-05-07T20:32:35.8013598Z scale_ub_tensor = None 2025-05-07T20:32:35.8013718Z 2025-05-07T20:32:35.8013849Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.8014001Z op = silu_mul_quant 2025-05-07T20:32:35.8014090Z if compiled: 2025-05-07T20:32:35.8014191Z op = torch.compile(op) 2025-05-07T20:32:35.8014298Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.8014375Z 2025-05-07T20:32:35.8014468Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.8014475Z 2025-05-07T20:32:35.8014578Z moe/activation_test.py:117: 2025-05-07T20:32:35.8014705Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.8014807Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.8014913Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.8015407Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.8015505Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.8015877Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.8016098Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.8016448Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.8016540Z kernel = self.compile( 2025-05-07T20:32:35.8016967Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.8017145Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.8017268Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.8017273Z 2025-05-07T20:32:35.8017475Z self = 2025-05-07T20:32:35.8018246Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.8018753Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fcec13a44c0>} 2025-05-07T20:32:35.8019494Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.8019685Z context = 2025-05-07T20:32:35.8019690Z 2025-05-07T20:32:35.8019945Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.8020205Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.8020311Z module_map=module_map) 2025-05-07T20:32:35.8020481Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.8020581Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.8020661Z E ^ 2025-05-07T20:32:35.8021019Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.8021023Z 2025-05-07T20:32:35.8021435Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.8021491Z 2025-05-07T20:32:35.8021601Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.8021824Z self=, 2025-05-07T20:32:35.8021899Z T=4096, 2025-05-07T20:32:35.8021976Z D=7168, 2025-05-07T20:32:35.8022058Z scale_ub=1200.0, 2025-05-07T20:32:35.8022141Z contiguous=False, 2025-05-07T20:32:35.8022226Z compiled=True, 2025-05-07T20:32:35.8022298Z ) 2025-05-07T20:32:35.8022557Z self = 2025-05-07T20:32:35.8022767Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:35.8022772Z 2025-05-07T20:32:35.8022846Z @given( 2025-05-07T20:32:35.8022968Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.8023070Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.8023192Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.8023314Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.8023430Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.8023508Z ) 2025-05-07T20:32:35.8023752Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.8023846Z def test_silu_mul_quant( 2025-05-07T20:32:35.8023926Z self, 2025-05-07T20:32:35.8024001Z T: int, 2025-05-07T20:32:35.8024075Z D: int, 2025-05-07T20:32:35.8024182Z scale_ub: Optional[float], 2025-05-07T20:32:35.8024271Z contiguous: bool, 2025-05-07T20:32:35.8024356Z compiled: bool, 2025-05-07T20:32:35.8024442Z ) -> None: 2025-05-07T20:32:35.8024536Z torch.manual_seed(2025) 2025-05-07T20:32:35.8024607Z 2025-05-07T20:32:35.8024776Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.8024848Z 2025-05-07T20:32:35.8024987Z x_sign = torch.sign(x) 2025-05-07T20:32:35.8025116Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.8025205Z x = x_sign * x_clamp 2025-05-07T20:32:35.8025287Z x0 = x[:, :D] 2025-05-07T20:32:35.8025365Z x1 = x[:, D:] 2025-05-07T20:32:35.8025437Z 2025-05-07T20:32:35.8025525Z if contiguous: 2025-05-07T20:32:35.8025615Z x0 = x0.contiguous() 2025-05-07T20:32:35.8025701Z x1 = x1.contiguous() 2025-05-07T20:32:35.8025776Z 2025-05-07T20:32:35.8025868Z if scale_ub is not None: 2025-05-07T20:32:35.8025977Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.8026117Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.8026192Z ) 2025-05-07T20:32:35.8026266Z else: 2025-05-07T20:32:35.8026363Z scale_ub_tensor = None 2025-05-07T20:32:35.8026435Z 2025-05-07T20:32:35.8026565Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.8026661Z op = silu_mul_quant 2025-05-07T20:32:35.8026746Z if compiled: 2025-05-07T20:32:35.8026846Z op = torch.compile(op) 2025-05-07T20:32:35.8026952Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.8027024Z 2025-05-07T20:32:35.8027122Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.8027127Z 2025-05-07T20:32:35.8027224Z moe/activation_test.py:117: 2025-05-07T20:32:35.8027351Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.8027455Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.8027554Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.8027924Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:35.8028017Z return fn(*args, **kwargs) 2025-05-07T20:32:35.8028504Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.8028652Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.8029015Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.8029236Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.8029588Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.8029682Z kernel = self.compile( 2025-05-07T20:32:35.8030070Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.8030322Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.8030449Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.8030454Z 2025-05-07T20:32:35.8030662Z self = 2025-05-07T20:32:35.8031498Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.8032011Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fcec13a51b0>} 2025-05-07T20:32:35.8032750Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.8032946Z context = 2025-05-07T20:32:35.8032953Z 2025-05-07T20:32:35.8033124Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.8033428Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.8033545Z module_map=module_map) 2025-05-07T20:32:35.8033706Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.8033803Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.8033881Z E ^ 2025-05-07T20:32:35.8034233Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.8034237Z 2025-05-07T20:32:35.8034649Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.8034656Z 2025-05-07T20:32:35.8034762Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.8034981Z self=, 2025-05-07T20:32:35.8035059Z T=128, 2025-05-07T20:32:35.8035136Z D=7168, 2025-05-07T20:32:35.8035221Z scale_ub=1200.0, 2025-05-07T20:32:35.8035316Z contiguous=False, 2025-05-07T20:32:35.8035403Z compiled=True, 2025-05-07T20:32:35.8035477Z ) 2025-05-07T20:32:35.8035695Z self = 2025-05-07T20:32:35.8035867Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:35.8035871Z 2025-05-07T20:32:35.8035949Z @given( 2025-05-07T20:32:35.8036065Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.8036161Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.8036277Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.8036397Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.8036512Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.8036587Z ) 2025-05-07T20:32:35.8036828Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.8036923Z def test_silu_mul_quant( 2025-05-07T20:32:35.8036997Z self, 2025-05-07T20:32:35.8037117Z T: int, 2025-05-07T20:32:35.8037198Z D: int, 2025-05-07T20:32:35.8037296Z scale_ub: Optional[float], 2025-05-07T20:32:35.8037384Z contiguous: bool, 2025-05-07T20:32:35.8037473Z compiled: bool, 2025-05-07T20:32:35.8037550Z ) -> None: 2025-05-07T20:32:35.8037643Z torch.manual_seed(2025) 2025-05-07T20:32:35.8037717Z 2025-05-07T20:32:35.8037888Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.8037962Z 2025-05-07T20:32:35.8038056Z x_sign = torch.sign(x) 2025-05-07T20:32:35.8038225Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.8038315Z x = x_sign * x_clamp 2025-05-07T20:32:35.8038434Z x0 = x[:, :D] 2025-05-07T20:32:35.8038515Z x1 = x[:, D:] 2025-05-07T20:32:35.8038591Z 2025-05-07T20:32:35.8038672Z if contiguous: 2025-05-07T20:32:35.8038763Z x0 = x0.contiguous() 2025-05-07T20:32:35.8038856Z x1 = x1.contiguous() 2025-05-07T20:32:35.8038928Z 2025-05-07T20:32:35.8039017Z if scale_ub is not None: 2025-05-07T20:32:35.8039124Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.8039257Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.8039331Z ) 2025-05-07T20:32:35.8039408Z else: 2025-05-07T20:32:35.8039503Z scale_ub_tensor = None 2025-05-07T20:32:35.8039574Z 2025-05-07T20:32:35.8039705Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.8039797Z op = silu_mul_quant 2025-05-07T20:32:35.8039885Z if compiled: 2025-05-07T20:32:35.8039984Z op = torch.compile(op) 2025-05-07T20:32:35.8040093Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.8040166Z 2025-05-07T20:32:35.8040260Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.8040265Z 2025-05-07T20:32:35.8040361Z moe/activation_test.py:117: 2025-05-07T20:32:35.8040537Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.8040640Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.8040739Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.8041106Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:35.8041197Z return fn(*args, **kwargs) 2025-05-07T20:32:35.8041687Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.8041786Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.8042146Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.8042368Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.8042710Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.8042807Z kernel = self.compile( 2025-05-07T20:32:35.8043196Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.8043371Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.8043498Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.8043503Z 2025-05-07T20:32:35.8043708Z self = 2025-05-07T20:32:35.8044480Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.8044980Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fcec13a40d0>} 2025-05-07T20:32:35.8045779Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.8045973Z context = 2025-05-07T20:32:35.8045978Z 2025-05-07T20:32:35.8046141Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.8046407Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.8046578Z module_map=module_map) 2025-05-07T20:32:35.8046777Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.8046876Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.8046954Z E ^ 2025-05-07T20:32:35.8047309Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.8047317Z 2025-05-07T20:32:35.8047741Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.8047745Z 2025-05-07T20:32:35.8047847Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.8048069Z self=, 2025-05-07T20:32:35.8048144Z T=2048, 2025-05-07T20:32:35.8048218Z D=7168, 2025-05-07T20:32:35.8048300Z scale_ub=None, 2025-05-07T20:32:35.8048383Z contiguous=True, 2025-05-07T20:32:35.8048468Z compiled=True, 2025-05-07T20:32:35.8048541Z ) 2025-05-07T20:32:35.8048760Z self = 2025-05-07T20:32:35.8048931Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:35.8048938Z 2025-05-07T20:32:35.8049012Z @given( 2025-05-07T20:32:35.8049129Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.8049271Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.8049388Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.8049507Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.8049624Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.8049696Z ) 2025-05-07T20:32:35.8049939Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.8050036Z def test_silu_mul_quant( 2025-05-07T20:32:35.8050112Z self, 2025-05-07T20:32:35.8050192Z T: int, 2025-05-07T20:32:35.8050270Z D: int, 2025-05-07T20:32:35.8050367Z scale_ub: Optional[float], 2025-05-07T20:32:35.8050466Z contiguous: bool, 2025-05-07T20:32:35.8050552Z compiled: bool, 2025-05-07T20:32:35.8050629Z ) -> None: 2025-05-07T20:32:35.8050724Z torch.manual_seed(2025) 2025-05-07T20:32:35.8050795Z 2025-05-07T20:32:35.8050960Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.8051038Z 2025-05-07T20:32:35.8051129Z x_sign = torch.sign(x) 2025-05-07T20:32:35.8051250Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.8051340Z x = x_sign * x_clamp 2025-05-07T20:32:35.8051418Z x0 = x[:, :D] 2025-05-07T20:32:35.8051495Z x1 = x[:, D:] 2025-05-07T20:32:35.8051569Z 2025-05-07T20:32:35.8051650Z if contiguous: 2025-05-07T20:32:35.8051744Z x0 = x0.contiguous() 2025-05-07T20:32:35.8051832Z x1 = x1.contiguous() 2025-05-07T20:32:35.8051905Z 2025-05-07T20:32:35.8051996Z if scale_ub is not None: 2025-05-07T20:32:35.8052101Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.8052236Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.8052316Z ) 2025-05-07T20:32:35.8052391Z else: 2025-05-07T20:32:35.8052486Z scale_ub_tensor = None 2025-05-07T20:32:35.8052561Z 2025-05-07T20:32:35.8052738Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.8052829Z op = silu_mul_quant 2025-05-07T20:32:35.8052919Z if compiled: 2025-05-07T20:32:35.8053018Z op = torch.compile(op) 2025-05-07T20:32:35.8053125Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.8053196Z 2025-05-07T20:32:35.8053287Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.8053292Z 2025-05-07T20:32:35.8053394Z moe/activation_test.py:117: 2025-05-07T20:32:35.8053521Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.8053665Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.8053801Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.8054164Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:35.8054256Z return fn(*args, **kwargs) 2025-05-07T20:32:35.8054752Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.8054850Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.8055211Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.8055435Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.8055779Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.8055882Z kernel = self.compile( 2025-05-07T20:32:35.8056271Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.8056448Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.8056574Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.8056579Z 2025-05-07T20:32:35.8056830Z self = 2025-05-07T20:32:35.8057615Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.8058120Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fcec13a6560>} 2025-05-07T20:32:35.8058870Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.8059059Z context = 2025-05-07T20:32:35.8059063Z 2025-05-07T20:32:35.8059226Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.8059498Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.8059604Z module_map=module_map) 2025-05-07T20:32:35.8059833Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.8059956Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.8060031Z E ^ 2025-05-07T20:32:35.8060387Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.8060391Z 2025-05-07T20:32:35.8060800Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.8060807Z 2025-05-07T20:32:35.8060915Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.8061161Z self=, 2025-05-07T20:32:35.8061251Z T=16384, 2025-05-07T20:32:35.8061332Z D=5120, 2025-05-07T20:32:35.8061464Z scale_ub=None, 2025-05-07T20:32:35.8061551Z contiguous=False, 2025-05-07T20:32:35.8061638Z compiled=False, 2025-05-07T20:32:35.8061706Z ) 2025-05-07T20:32:35.8061916Z self = 2025-05-07T20:32:35.8062095Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:35.8062100Z 2025-05-07T20:32:35.8062170Z @given( 2025-05-07T20:32:35.8062290Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.8062387Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.8062542Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.8062698Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.8062811Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.8062882Z ) 2025-05-07T20:32:35.8063129Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.8063222Z def test_silu_mul_quant( 2025-05-07T20:32:35.8063296Z self, 2025-05-07T20:32:35.8063370Z T: int, 2025-05-07T20:32:35.8063443Z D: int, 2025-05-07T20:32:35.8063539Z scale_ub: Optional[float], 2025-05-07T20:32:35.8063629Z contiguous: bool, 2025-05-07T20:32:35.8063709Z compiled: bool, 2025-05-07T20:32:35.8063785Z ) -> None: 2025-05-07T20:32:35.8063879Z torch.manual_seed(2025) 2025-05-07T20:32:35.8063948Z 2025-05-07T20:32:35.8064118Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.8064191Z 2025-05-07T20:32:35.8064280Z x_sign = torch.sign(x) 2025-05-07T20:32:35.8064407Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.8066239Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:35.8066249Z 2025-05-07T20:32:35.8066369Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:35.8066374Z 2025-05-07T20:32:35.8066471Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.8066690Z self=, 2025-05-07T20:32:35.8066765Z T=4096, 2025-05-07T20:32:35.8066838Z D=7168, 2025-05-07T20:32:35.8066922Z scale_ub=1200.0, 2025-05-07T20:32:35.8067002Z contiguous=True, 2025-05-07T20:32:35.8067081Z compiled=True, 2025-05-07T20:32:35.8067155Z ) 2025-05-07T20:32:35.8067364Z self = 2025-05-07T20:32:35.8067536Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:35.8067541Z 2025-05-07T20:32:35.8067611Z @given( 2025-05-07T20:32:35.8067723Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.8067819Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.8067936Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.8068050Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.8068163Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.8068236Z ) 2025-05-07T20:32:35.8068482Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.8068577Z def test_silu_mul_quant( 2025-05-07T20:32:35.8068652Z self, 2025-05-07T20:32:35.8068724Z T: int, 2025-05-07T20:32:35.8068797Z D: int, 2025-05-07T20:32:35.8068892Z scale_ub: Optional[float], 2025-05-07T20:32:35.8068979Z contiguous: bool, 2025-05-07T20:32:35.8069110Z compiled: bool, 2025-05-07T20:32:35.8069185Z ) -> None: 2025-05-07T20:32:35.8069278Z torch.manual_seed(2025) 2025-05-07T20:32:35.8069350Z 2025-05-07T20:32:35.8069513Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.8069590Z 2025-05-07T20:32:35.8069681Z x_sign = torch.sign(x) 2025-05-07T20:32:35.8069803Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.8071620Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:35.8071668Z 2025-05-07T20:32:35.8071786Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:35.8071791Z 2025-05-07T20:32:35.8071892Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.8072109Z self=, 2025-05-07T20:32:35.8072182Z T=16384, 2025-05-07T20:32:35.8072258Z D=7168, 2025-05-07T20:32:35.8072336Z scale_ub=None, 2025-05-07T20:32:35.8072420Z contiguous=False, 2025-05-07T20:32:35.8072508Z compiled=False, 2025-05-07T20:32:35.8072587Z ) 2025-05-07T20:32:35.8072801Z self = 2025-05-07T20:32:35.8072976Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:35.8072981Z 2025-05-07T20:32:35.8073052Z @given( 2025-05-07T20:32:35.8073170Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.8073264Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.8073419Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.8073536Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.8073647Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.8073715Z ) 2025-05-07T20:32:35.8073965Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.8074055Z def test_silu_mul_quant( 2025-05-07T20:32:35.8074135Z self, 2025-05-07T20:32:35.8074206Z T: int, 2025-05-07T20:32:35.8074282Z D: int, 2025-05-07T20:32:35.8074379Z scale_ub: Optional[float], 2025-05-07T20:32:35.8074466Z contiguous: bool, 2025-05-07T20:32:35.8074551Z compiled: bool, 2025-05-07T20:32:35.8077444Z ) -> None: 2025-05-07T20:32:35.8077558Z torch.manual_seed(2025) 2025-05-07T20:32:35.8077634Z 2025-05-07T20:32:35.8077805Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.8079621Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:35.8079634Z 2025-05-07T20:32:35.8079754Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:35.8079759Z 2025-05-07T20:32:35.8079865Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.8080090Z self=, 2025-05-07T20:32:35.8080166Z T=2048, 2025-05-07T20:32:35.8080247Z D=7168, 2025-05-07T20:32:35.8080329Z scale_ub=1200.0, 2025-05-07T20:32:35.8080502Z contiguous=True, 2025-05-07T20:32:35.8080588Z compiled=True, 2025-05-07T20:32:35.8080660Z ) 2025-05-07T20:32:35.8080873Z self = 2025-05-07T20:32:35.8081043Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:35.8081048Z 2025-05-07T20:32:35.8081123Z @given( 2025-05-07T20:32:35.8081239Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.8081342Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.8081499Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.8081619Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.8081771Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.8081846Z ) 2025-05-07T20:32:35.8082097Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.8082189Z def test_silu_mul_quant( 2025-05-07T20:32:35.8082268Z self, 2025-05-07T20:32:35.8082352Z T: int, 2025-05-07T20:32:35.8082429Z D: int, 2025-05-07T20:32:35.8082526Z scale_ub: Optional[float], 2025-05-07T20:32:35.8082617Z contiguous: bool, 2025-05-07T20:32:35.8082702Z compiled: bool, 2025-05-07T20:32:35.8082781Z ) -> None: 2025-05-07T20:32:35.8082878Z torch.manual_seed(2025) 2025-05-07T20:32:35.8082949Z 2025-05-07T20:32:35.8083114Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.8083188Z 2025-05-07T20:32:35.8083285Z x_sign = torch.sign(x) 2025-05-07T20:32:35.8083412Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.8085217Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:35.8085228Z 2025-05-07T20:32:35.8085351Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:35.8085356Z 2025-05-07T20:32:35.8085457Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.8085676Z self=, 2025-05-07T20:32:35.8085760Z T=2048, 2025-05-07T20:32:35.8085835Z D=7168, 2025-05-07T20:32:35.8085917Z scale_ub=None, 2025-05-07T20:32:35.8086012Z contiguous=True, 2025-05-07T20:32:35.8086096Z compiled=False, 2025-05-07T20:32:35.8086167Z ) 2025-05-07T20:32:35.8086383Z self = 2025-05-07T20:32:35.8086553Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:35.8086560Z 2025-05-07T20:32:35.8086636Z @given( 2025-05-07T20:32:35.8086751Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.8086858Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.8086974Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.8087094Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.8087207Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.8087286Z ) 2025-05-07T20:32:35.8087527Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.8087630Z def test_silu_mul_quant( 2025-05-07T20:32:35.8087708Z self, 2025-05-07T20:32:35.8087785Z T: int, 2025-05-07T20:32:35.8087862Z D: int, 2025-05-07T20:32:35.8087963Z scale_ub: Optional[float], 2025-05-07T20:32:35.8088050Z contiguous: bool, 2025-05-07T20:32:35.8088140Z compiled: bool, 2025-05-07T20:32:35.8088264Z ) -> None: 2025-05-07T20:32:35.8088362Z torch.manual_seed(2025) 2025-05-07T20:32:35.8088441Z 2025-05-07T20:32:35.8088606Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.8088684Z 2025-05-07T20:32:35.8088782Z > x_sign = torch.sign(x) 2025-05-07T20:32:35.8090882Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:35.8090945Z 2025-05-07T20:32:35.8091093Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:35.8091105Z 2025-05-07T20:32:35.8091217Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.8091464Z self=, 2025-05-07T20:32:35.8091541Z T=1, 2025-05-07T20:32:35.8091615Z D=7168, 2025-05-07T20:32:35.8091700Z scale_ub=1200.0, 2025-05-07T20:32:35.8091793Z contiguous=True, 2025-05-07T20:32:35.8091877Z compiled=False, 2025-05-07T20:32:35.8091953Z ) 2025-05-07T20:32:35.8092169Z self = 2025-05-07T20:32:35.8092335Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:35.8092340Z 2025-05-07T20:32:35.8092423Z @given( 2025-05-07T20:32:35.8092539Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.8092640Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.8092754Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.8092930Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.8093056Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.8093129Z ) 2025-05-07T20:32:35.8093371Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.8093470Z def test_silu_mul_quant( 2025-05-07T20:32:35.8093545Z self, 2025-05-07T20:32:35.8093621Z T: int, 2025-05-07T20:32:35.8093698Z D: int, 2025-05-07T20:32:35.8093795Z scale_ub: Optional[float], 2025-05-07T20:32:35.8093885Z contiguous: bool, 2025-05-07T20:32:35.8093984Z compiled: bool, 2025-05-07T20:32:35.8094063Z ) -> None: 2025-05-07T20:32:35.8094163Z torch.manual_seed(2025) 2025-05-07T20:32:35.8094237Z 2025-05-07T20:32:35.8094407Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.8094483Z 2025-05-07T20:32:35.8094575Z x_sign = torch.sign(x) 2025-05-07T20:32:35.8094699Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.8094803Z x = x_sign * x_clamp 2025-05-07T20:32:35.8094889Z x0 = x[:, :D] 2025-05-07T20:32:35.8094968Z x1 = x[:, D:] 2025-05-07T20:32:35.8095042Z 2025-05-07T20:32:35.8095127Z if contiguous: 2025-05-07T20:32:35.8095217Z x0 = x0.contiguous() 2025-05-07T20:32:35.8095307Z x1 = x1.contiguous() 2025-05-07T20:32:35.8095377Z 2025-05-07T20:32:35.8095468Z if scale_ub is not None: 2025-05-07T20:32:35.8095581Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.8095720Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.8095802Z ) 2025-05-07T20:32:35.8095880Z else: 2025-05-07T20:32:35.8095975Z scale_ub_tensor = None 2025-05-07T20:32:35.8096055Z 2025-05-07T20:32:35.8096184Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.8096271Z op = silu_mul_quant 2025-05-07T20:32:35.8096426Z if compiled: 2025-05-07T20:32:35.8096531Z op = torch.compile(op) 2025-05-07T20:32:35.8096637Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.8096710Z 2025-05-07T20:32:35.8096799Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.8096803Z 2025-05-07T20:32:35.8096902Z moe/activation_test.py:117: 2025-05-07T20:32:35.8097030Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.8097128Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.8097228Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.8097819Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.8097919Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.8098282Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.8098510Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.8098866Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.8098962Z kernel = self.compile( 2025-05-07T20:32:35.8099340Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.8099513Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.8099642Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.8099650Z 2025-05-07T20:32:35.8099936Z self = 2025-05-07T20:32:35.8100711Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.8101265Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fcec0f644c0>} 2025-05-07T20:32:35.8102018Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.8102209Z context = 2025-05-07T20:32:35.8102214Z 2025-05-07T20:32:35.8102379Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.8102652Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.8102755Z module_map=module_map) 2025-05-07T20:32:35.8102914Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.8103013Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.8103090Z E ^ 2025-05-07T20:32:35.8103440Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.8103444Z 2025-05-07T20:32:35.8103858Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.8103863Z 2025-05-07T20:32:35.8103962Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.8104180Z self=, 2025-05-07T20:32:35.8104259Z T=128, 2025-05-07T20:32:35.8104333Z D=5120, 2025-05-07T20:32:35.8104419Z scale_ub=None, 2025-05-07T20:32:35.8104500Z contiguous=True, 2025-05-07T20:32:35.8104580Z compiled=False, 2025-05-07T20:32:35.8104653Z ) 2025-05-07T20:32:35.8104867Z self = 2025-05-07T20:32:35.8105032Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:35.8105085Z 2025-05-07T20:32:35.8105159Z @given( 2025-05-07T20:32:35.8105274Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.8105373Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.8105485Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.8105597Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.8105715Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.8105785Z ) 2025-05-07T20:32:35.8106024Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.8106162Z def test_silu_mul_quant( 2025-05-07T20:32:35.8106238Z self, 2025-05-07T20:32:35.8106349Z T: int, 2025-05-07T20:32:35.8106424Z D: int, 2025-05-07T20:32:35.8106520Z scale_ub: Optional[float], 2025-05-07T20:32:35.8106607Z contiguous: bool, 2025-05-07T20:32:35.8106690Z compiled: bool, 2025-05-07T20:32:35.8106763Z ) -> None: 2025-05-07T20:32:35.8106865Z torch.manual_seed(2025) 2025-05-07T20:32:35.8106936Z 2025-05-07T20:32:35.8107099Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.8107172Z 2025-05-07T20:32:35.8107260Z x_sign = torch.sign(x) 2025-05-07T20:32:35.8107380Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.8107471Z x = x_sign * x_clamp 2025-05-07T20:32:35.8107547Z x0 = x[:, :D] 2025-05-07T20:32:35.8107623Z x1 = x[:, D:] 2025-05-07T20:32:35.8107696Z 2025-05-07T20:32:35.8107781Z if contiguous: 2025-05-07T20:32:35.8107874Z x0 = x0.contiguous() 2025-05-07T20:32:35.8107963Z x1 = x1.contiguous() 2025-05-07T20:32:35.8108033Z 2025-05-07T20:32:35.8108124Z if scale_ub is not None: 2025-05-07T20:32:35.8108228Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.8108358Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.8108434Z ) 2025-05-07T20:32:35.8108577Z else: 2025-05-07T20:32:35.8108669Z scale_ub_tensor = None 2025-05-07T20:32:35.8108744Z 2025-05-07T20:32:35.8108870Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.8108957Z op = silu_mul_quant 2025-05-07T20:32:35.8109045Z if compiled: 2025-05-07T20:32:35.8109148Z op = torch.compile(op) 2025-05-07T20:32:35.8109254Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.8109320Z 2025-05-07T20:32:35.8109409Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.8109416Z 2025-05-07T20:32:35.8109515Z moe/activation_test.py:117: 2025-05-07T20:32:35.8109643Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.8109740Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.8109839Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.8110335Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.8110430Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.8110793Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.8111009Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.8111346Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.8111440Z kernel = self.compile( 2025-05-07T20:32:35.8111826Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.8112004Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.8112128Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.8112133Z 2025-05-07T20:32:35.8112341Z self = 2025-05-07T20:32:35.8113166Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.8113670Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fcec0f64940>} 2025-05-07T20:32:35.8114448Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.8114672Z context = 2025-05-07T20:32:35.8114677Z 2025-05-07T20:32:35.8114846Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.8115112Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.8115220Z module_map=module_map) 2025-05-07T20:32:35.8115380Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.8115474Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.8115553Z E ^ 2025-05-07T20:32:35.8115901Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.8115906Z 2025-05-07T20:32:35.8116320Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.8116325Z 2025-05-07T20:32:35.8116432Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.8116647Z self=, 2025-05-07T20:32:35.8116727Z T=128, 2025-05-07T20:32:35.8116798Z D=7168, 2025-05-07T20:32:35.8116878Z scale_ub=None, 2025-05-07T20:32:35.8117005Z contiguous=True, 2025-05-07T20:32:35.8117086Z compiled=False, 2025-05-07T20:32:35.8117157Z ) 2025-05-07T20:32:35.8117369Z self = 2025-05-07T20:32:35.8117534Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:35.8117539Z 2025-05-07T20:32:35.8117609Z @given( 2025-05-07T20:32:35.8117727Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.8117821Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.8117942Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.8118056Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.8118170Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.8118246Z ) 2025-05-07T20:32:35.8118493Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.8118593Z def test_silu_mul_quant( 2025-05-07T20:32:35.8118668Z self, 2025-05-07T20:32:35.8118745Z T: int, 2025-05-07T20:32:35.8118818Z D: int, 2025-05-07T20:32:35.8118919Z scale_ub: Optional[float], 2025-05-07T20:32:35.8119006Z contiguous: bool, 2025-05-07T20:32:35.8119088Z compiled: bool, 2025-05-07T20:32:35.8119166Z ) -> None: 2025-05-07T20:32:35.8119257Z torch.manual_seed(2025) 2025-05-07T20:32:35.8119327Z 2025-05-07T20:32:35.8119491Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.8119560Z 2025-05-07T20:32:35.8119656Z x_sign = torch.sign(x) 2025-05-07T20:32:35.8119780Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.8119870Z x = x_sign * x_clamp 2025-05-07T20:32:35.8119948Z x0 = x[:, :D] 2025-05-07T20:32:35.8120024Z x1 = x[:, D:] 2025-05-07T20:32:35.8120094Z 2025-05-07T20:32:35.8120181Z if contiguous: 2025-05-07T20:32:35.8120270Z x0 = x0.contiguous() 2025-05-07T20:32:35.8120404Z x1 = x1.contiguous() 2025-05-07T20:32:35.8120477Z 2025-05-07T20:32:35.8120566Z if scale_ub is not None: 2025-05-07T20:32:35.8120666Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.8120800Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.8120876Z ) 2025-05-07T20:32:35.8120958Z else: 2025-05-07T20:32:35.8121049Z scale_ub_tensor = None 2025-05-07T20:32:35.8121117Z 2025-05-07T20:32:35.8121246Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.8121376Z op = silu_mul_quant 2025-05-07T20:32:35.8121459Z if compiled: 2025-05-07T20:32:35.8121603Z op = torch.compile(op) 2025-05-07T20:32:35.8121708Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.8121779Z 2025-05-07T20:32:35.8121869Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.8121874Z 2025-05-07T20:32:35.8121968Z moe/activation_test.py:117: 2025-05-07T20:32:35.8122100Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.8122201Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.8122295Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.8122786Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.8122879Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.8123231Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.8123456Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.8123796Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.8123888Z kernel = self.compile( 2025-05-07T20:32:35.8124309Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.8124485Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.8124610Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.8124616Z 2025-05-07T20:32:35.8124820Z self = 2025-05-07T20:32:35.8125590Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.8126093Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fcec0f65240>} 2025-05-07T20:32:35.8126843Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.8127035Z context = 2025-05-07T20:32:35.8127039Z 2025-05-07T20:32:35.8127202Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.8127464Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.8127567Z module_map=module_map) 2025-05-07T20:32:35.8127725Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.8127826Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.8127903Z E ^ 2025-05-07T20:32:35.8128254Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.8128263Z 2025-05-07T20:32:35.8128669Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.8128717Z 2025-05-07T20:32:35.8128820Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.8129041Z self=, 2025-05-07T20:32:35.8129120Z T=2048, 2025-05-07T20:32:35.8129190Z D=7168, 2025-05-07T20:32:35.8129270Z scale_ub=1200.0, 2025-05-07T20:32:35.8129350Z contiguous=True, 2025-05-07T20:32:35.8129430Z compiled=False, 2025-05-07T20:32:35.8129502Z ) 2025-05-07T20:32:35.8129713Z self = 2025-05-07T20:32:35.8129933Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:35.8129938Z 2025-05-07T20:32:35.8130047Z @given( 2025-05-07T20:32:35.8130164Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.8130263Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.8130376Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.8130498Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.8130612Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.8130683Z ) 2025-05-07T20:32:35.8130923Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.8131016Z def test_silu_mul_quant( 2025-05-07T20:32:35.8131090Z self, 2025-05-07T20:32:35.8131166Z T: int, 2025-05-07T20:32:35.8131239Z D: int, 2025-05-07T20:32:35.8131336Z scale_ub: Optional[float], 2025-05-07T20:32:35.8131424Z contiguous: bool, 2025-05-07T20:32:35.8131509Z compiled: bool, 2025-05-07T20:32:35.8131585Z ) -> None: 2025-05-07T20:32:35.8131685Z torch.manual_seed(2025) 2025-05-07T20:32:35.8131753Z 2025-05-07T20:32:35.8131920Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.8133724Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:35.8133734Z 2025-05-07T20:32:35.8133850Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:35.8133857Z 2025-05-07T20:32:35.8133960Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.8134179Z self=, 2025-05-07T20:32:35.8134256Z T=1, 2025-05-07T20:32:35.8134330Z D=5120, 2025-05-07T20:32:35.8134415Z scale_ub=1200.0, 2025-05-07T20:32:35.8134497Z contiguous=True, 2025-05-07T20:32:35.8134578Z compiled=False, 2025-05-07T20:32:35.8134648Z ) 2025-05-07T20:32:35.8134864Z self = 2025-05-07T20:32:35.8135026Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:35.8135031Z 2025-05-07T20:32:35.8135102Z @given( 2025-05-07T20:32:35.8135221Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.8135315Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.8135429Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.8135541Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.8135656Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.8135734Z ) 2025-05-07T20:32:35.8135981Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.8136072Z def test_silu_mul_quant( 2025-05-07T20:32:35.8136150Z self, 2025-05-07T20:32:35.8136225Z T: int, 2025-05-07T20:32:35.8136300Z D: int, 2025-05-07T20:32:35.8136445Z scale_ub: Optional[float], 2025-05-07T20:32:35.8136532Z contiguous: bool, 2025-05-07T20:32:35.8136615Z compiled: bool, 2025-05-07T20:32:35.8136692Z ) -> None: 2025-05-07T20:32:35.8136782Z torch.manual_seed(2025) 2025-05-07T20:32:35.8136854Z 2025-05-07T20:32:35.8137023Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.8137095Z 2025-05-07T20:32:35.8137187Z x_sign = torch.sign(x) 2025-05-07T20:32:35.8137307Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.8137440Z x = x_sign * x_clamp 2025-05-07T20:32:35.8137517Z x0 = x[:, :D] 2025-05-07T20:32:35.8137655Z x1 = x[:, D:] 2025-05-07T20:32:35.8137731Z 2025-05-07T20:32:35.8137813Z if contiguous: 2025-05-07T20:32:35.8137904Z x0 = x0.contiguous() 2025-05-07T20:32:35.8137992Z x1 = x1.contiguous() 2025-05-07T20:32:35.8138065Z 2025-05-07T20:32:35.8138158Z if scale_ub is not None: 2025-05-07T20:32:35.8138270Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.8138402Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.8138475Z ) 2025-05-07T20:32:35.8138555Z else: 2025-05-07T20:32:35.8138647Z scale_ub_tensor = None 2025-05-07T20:32:35.8138714Z 2025-05-07T20:32:35.8138852Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.8138939Z op = silu_mul_quant 2025-05-07T20:32:35.8139026Z if compiled: 2025-05-07T20:32:35.8139127Z op = torch.compile(op) 2025-05-07T20:32:35.8139231Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.8139307Z 2025-05-07T20:32:35.8139396Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.8139400Z 2025-05-07T20:32:35.8139495Z moe/activation_test.py:117: 2025-05-07T20:32:35.8139624Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.8139844Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.8139967Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.8140479Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.8140580Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.8140941Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.8141159Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.8141503Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.8141601Z kernel = self.compile( 2025-05-07T20:32:35.8141979Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.8142152Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.8142279Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.8142284Z 2025-05-07T20:32:35.8142485Z self = 2025-05-07T20:32:35.8143259Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.8143762Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fcec0f66200>} 2025-05-07T20:32:35.8144505Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.8144769Z context = 2025-05-07T20:32:35.8144774Z 2025-05-07T20:32:35.8144954Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.8145265Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.8145377Z module_map=module_map) 2025-05-07T20:32:35.8145543Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.8145639Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.8145714Z E ^ 2025-05-07T20:32:35.8146106Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.8146147Z 2025-05-07T20:32:35.8146562Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.8146567Z 2025-05-07T20:32:35.8146670Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.8146892Z self=, 2025-05-07T20:32:35.8146968Z T=2048, 2025-05-07T20:32:35.8147045Z D=5120, 2025-05-07T20:32:35.8147124Z scale_ub=None, 2025-05-07T20:32:35.8147205Z contiguous=True, 2025-05-07T20:32:35.8147287Z compiled=False, 2025-05-07T20:32:35.8147358Z ) 2025-05-07T20:32:35.8147569Z self = 2025-05-07T20:32:35.8147743Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:35.8147751Z 2025-05-07T20:32:35.8147822Z @given( 2025-05-07T20:32:35.8147939Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.8148035Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.8148147Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.8148267Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.8148377Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.8148449Z ) 2025-05-07T20:32:35.8148745Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.8148836Z def test_silu_mul_quant( 2025-05-07T20:32:35.8148909Z self, 2025-05-07T20:32:35.8148984Z T: int, 2025-05-07T20:32:35.8149059Z D: int, 2025-05-07T20:32:35.8149159Z scale_ub: Optional[float], 2025-05-07T20:32:35.8149244Z contiguous: bool, 2025-05-07T20:32:35.8149329Z compiled: bool, 2025-05-07T20:32:35.8149406Z ) -> None: 2025-05-07T20:32:35.8149500Z torch.manual_seed(2025) 2025-05-07T20:32:35.8149571Z 2025-05-07T20:32:35.8149745Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.8149817Z 2025-05-07T20:32:35.8149908Z > x_sign = torch.sign(x) 2025-05-07T20:32:35.8151697Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:35.8151705Z 2025-05-07T20:32:35.8151819Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:35.8151824Z 2025-05-07T20:32:35.8151932Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.8152151Z self=, 2025-05-07T20:32:35.8152230Z T=16384, 2025-05-07T20:32:35.8152304Z D=5120, 2025-05-07T20:32:35.8152382Z scale_ub=None, 2025-05-07T20:32:35.8152467Z contiguous=True, 2025-05-07T20:32:35.8152547Z compiled=False, 2025-05-07T20:32:35.8152618Z ) 2025-05-07T20:32:35.8152878Z self = 2025-05-07T20:32:35.8153049Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:35.8153053Z 2025-05-07T20:32:35.8153123Z @given( 2025-05-07T20:32:35.8153244Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.8153344Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.8153460Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.8153577Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.8153731Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.8153806Z ) 2025-05-07T20:32:35.8154089Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.8154179Z def test_silu_mul_quant( 2025-05-07T20:32:35.8154255Z self, 2025-05-07T20:32:35.8154329Z T: int, 2025-05-07T20:32:35.8154401Z D: int, 2025-05-07T20:32:35.8154508Z scale_ub: Optional[float], 2025-05-07T20:32:35.8154600Z contiguous: bool, 2025-05-07T20:32:35.8154683Z compiled: bool, 2025-05-07T20:32:35.8154766Z ) -> None: 2025-05-07T20:32:35.8154862Z torch.manual_seed(2025) 2025-05-07T20:32:35.8154931Z 2025-05-07T20:32:35.8155099Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.8156894Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:35.8156908Z 2025-05-07T20:32:35.8157066Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:35.8157072Z 2025-05-07T20:32:35.8157173Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.8157391Z self=, 2025-05-07T20:32:35.8157462Z T=4096, 2025-05-07T20:32:35.8157535Z D=5120, 2025-05-07T20:32:35.8157625Z scale_ub=None, 2025-05-07T20:32:35.8157706Z contiguous=True, 2025-05-07T20:32:35.8157784Z compiled=False, 2025-05-07T20:32:35.8157854Z ) 2025-05-07T20:32:35.8158064Z self = 2025-05-07T20:32:35.8158238Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:35.8158249Z 2025-05-07T20:32:35.8158321Z @given( 2025-05-07T20:32:35.8158432Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.8158529Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.8158640Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.8158758Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.8158870Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.8158940Z ) 2025-05-07T20:32:35.8159180Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.8159274Z def test_silu_mul_quant( 2025-05-07T20:32:35.8159347Z self, 2025-05-07T20:32:35.8159418Z T: int, 2025-05-07T20:32:35.8159491Z D: int, 2025-05-07T20:32:35.8159586Z scale_ub: Optional[float], 2025-05-07T20:32:35.8159678Z contiguous: bool, 2025-05-07T20:32:35.8159760Z compiled: bool, 2025-05-07T20:32:35.8159833Z ) -> None: 2025-05-07T20:32:35.8159929Z torch.manual_seed(2025) 2025-05-07T20:32:35.8159999Z 2025-05-07T20:32:35.8160165Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.8161923Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:35.8161973Z 2025-05-07T20:32:35.8162090Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:35.8162133Z 2025-05-07T20:32:35.8162237Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.8162490Z self=, 2025-05-07T20:32:35.8162561Z T=2048, 2025-05-07T20:32:35.8162637Z D=5120, 2025-05-07T20:32:35.8162717Z scale_ub=None, 2025-05-07T20:32:35.8162801Z contiguous=False, 2025-05-07T20:32:35.8162883Z compiled=False, 2025-05-07T20:32:35.8162954Z ) 2025-05-07T20:32:35.8163166Z self = 2025-05-07T20:32:35.8163335Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:35.8163339Z 2025-05-07T20:32:35.8163408Z @given( 2025-05-07T20:32:35.8163526Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.8163618Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.8163732Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.8163851Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.8163961Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.8164037Z ) 2025-05-07T20:32:35.8164275Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.8164367Z def test_silu_mul_quant( 2025-05-07T20:32:35.8164441Z self, 2025-05-07T20:32:35.8164511Z T: int, 2025-05-07T20:32:35.8164586Z D: int, 2025-05-07T20:32:35.8164725Z scale_ub: Optional[float], 2025-05-07T20:32:35.8164815Z contiguous: bool, 2025-05-07T20:32:35.8164899Z compiled: bool, 2025-05-07T20:32:35.8164973Z ) -> None: 2025-05-07T20:32:35.8165066Z torch.manual_seed(2025) 2025-05-07T20:32:35.8165136Z 2025-05-07T20:32:35.8165300Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.8167088Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:35.8167103Z 2025-05-07T20:32:35.8167218Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:35.8167222Z 2025-05-07T20:32:35.8167321Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.8167540Z self=, 2025-05-07T20:32:35.8167609Z T=4096, 2025-05-07T20:32:35.8167679Z D=7168, 2025-05-07T20:32:35.8167760Z scale_ub=None, 2025-05-07T20:32:35.8167840Z contiguous=True, 2025-05-07T20:32:35.8167917Z compiled=True, 2025-05-07T20:32:35.8167993Z ) 2025-05-07T20:32:35.8168206Z self = 2025-05-07T20:32:35.8168374Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:35.8168382Z 2025-05-07T20:32:35.8168457Z @given( 2025-05-07T20:32:35.8168569Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.8168666Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.8168826Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.8168942Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.8169057Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.8169126Z ) 2025-05-07T20:32:35.8169368Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.8169462Z def test_silu_mul_quant( 2025-05-07T20:32:35.8169537Z self, 2025-05-07T20:32:35.8169608Z T: int, 2025-05-07T20:32:35.8169683Z D: int, 2025-05-07T20:32:35.8169845Z scale_ub: Optional[float], 2025-05-07T20:32:35.8169932Z contiguous: bool, 2025-05-07T20:32:35.8170053Z compiled: bool, 2025-05-07T20:32:35.8170129Z ) -> None: 2025-05-07T20:32:35.8170224Z torch.manual_seed(2025) 2025-05-07T20:32:35.8170292Z 2025-05-07T20:32:35.8170454Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.8172259Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:35.8172268Z 2025-05-07T20:32:35.8172381Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:35.8172386Z 2025-05-07T20:32:35.8172489Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.8172704Z self=, 2025-05-07T20:32:35.8172775Z T=2048, 2025-05-07T20:32:35.8172847Z D=5120, 2025-05-07T20:32:35.8172925Z scale_ub=1200.0, 2025-05-07T20:32:35.8173050Z contiguous=False, 2025-05-07T20:32:35.8173130Z compiled=False, 2025-05-07T20:32:35.8173199Z ) 2025-05-07T20:32:35.8173412Z self = 2025-05-07T20:32:35.8173584Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:35.8173588Z 2025-05-07T20:32:35.8173658Z @given( 2025-05-07T20:32:35.8173773Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.8173865Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.8173981Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.8174097Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.8174212Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.8174285Z ) 2025-05-07T20:32:35.8174529Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.8174618Z def test_silu_mul_quant( 2025-05-07T20:32:35.8174696Z self, 2025-05-07T20:32:35.8174772Z T: int, 2025-05-07T20:32:35.8174844Z D: int, 2025-05-07T20:32:35.8174942Z scale_ub: Optional[float], 2025-05-07T20:32:35.8175028Z contiguous: bool, 2025-05-07T20:32:35.8175110Z compiled: bool, 2025-05-07T20:32:35.8175187Z ) -> None: 2025-05-07T20:32:35.8175279Z torch.manual_seed(2025) 2025-05-07T20:32:35.8175346Z 2025-05-07T20:32:35.8175512Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.8177300Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:35.8177357Z 2025-05-07T20:32:35.8177475Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:35.8177480Z 2025-05-07T20:32:35.8177578Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.8177798Z self=, 2025-05-07T20:32:35.8177868Z T=4096, 2025-05-07T20:32:35.8177941Z D=7168, 2025-05-07T20:32:35.8178022Z scale_ub=1200.0, 2025-05-07T20:32:35.8178149Z contiguous=True, 2025-05-07T20:32:35.8178227Z compiled=False, 2025-05-07T20:32:35.8178298Z ) 2025-05-07T20:32:35.8178544Z self = 2025-05-07T20:32:35.8178714Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:35.8178722Z 2025-05-07T20:32:35.8178798Z @given( 2025-05-07T20:32:35.8178912Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.8179019Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.8179131Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.8179244Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.8179359Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.8179429Z ) 2025-05-07T20:32:35.8179670Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.8179815Z def test_silu_mul_quant( 2025-05-07T20:32:35.8179906Z self, 2025-05-07T20:32:35.8179993Z T: int, 2025-05-07T20:32:35.8180069Z D: int, 2025-05-07T20:32:35.8180168Z scale_ub: Optional[float], 2025-05-07T20:32:35.8180265Z contiguous: bool, 2025-05-07T20:32:35.8180347Z compiled: bool, 2025-05-07T20:32:35.8180420Z ) -> None: 2025-05-07T20:32:35.8180516Z torch.manual_seed(2025) 2025-05-07T20:32:35.8180588Z 2025-05-07T20:32:35.8180803Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.8182605Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:35.8182614Z 2025-05-07T20:32:35.8182734Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:35.8182739Z 2025-05-07T20:32:35.8182842Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.8183060Z self=, 2025-05-07T20:32:35.8183138Z T=16384, 2025-05-07T20:32:35.8183221Z D=7168, 2025-05-07T20:32:35.8183306Z scale_ub=None, 2025-05-07T20:32:35.8183394Z contiguous=False, 2025-05-07T20:32:35.8183472Z compiled=True, 2025-05-07T20:32:35.8183542Z ) 2025-05-07T20:32:35.8183755Z self = 2025-05-07T20:32:35.8183928Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:35.8183933Z 2025-05-07T20:32:35.8184003Z @given( 2025-05-07T20:32:35.8184117Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.8184214Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.8184328Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.8184455Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.8184568Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.8184642Z ) 2025-05-07T20:32:35.8184883Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.8185025Z def test_silu_mul_quant( 2025-05-07T20:32:35.8185108Z self, 2025-05-07T20:32:35.8185181Z T: int, 2025-05-07T20:32:35.8185253Z D: int, 2025-05-07T20:32:35.8185352Z scale_ub: Optional[float], 2025-05-07T20:32:35.8185438Z contiguous: bool, 2025-05-07T20:32:35.8185520Z compiled: bool, 2025-05-07T20:32:35.8185597Z ) -> None: 2025-05-07T20:32:35.8185688Z torch.manual_seed(2025) 2025-05-07T20:32:35.8185757Z 2025-05-07T20:32:35.8185924Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.8187804Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:35.8187821Z 2025-05-07T20:32:35.8187940Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:35.8187945Z 2025-05-07T20:32:35.8188044Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.8188261Z self=, 2025-05-07T20:32:35.8188332Z T=4096, 2025-05-07T20:32:35.8188401Z D=7168, 2025-05-07T20:32:35.8188488Z scale_ub=None, 2025-05-07T20:32:35.8188568Z contiguous=True, 2025-05-07T20:32:35.8188648Z compiled=False, 2025-05-07T20:32:35.8188724Z ) 2025-05-07T20:32:35.8188933Z self = 2025-05-07T20:32:35.8189101Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:35.8189108Z 2025-05-07T20:32:35.8189184Z @given( 2025-05-07T20:32:35.8189336Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.8189436Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.8189550Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.8189662Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.8189773Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.8190042Z ) 2025-05-07T20:32:35.8190312Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.8190413Z def test_silu_mul_quant( 2025-05-07T20:32:35.8190488Z self, 2025-05-07T20:32:35.8190567Z T: int, 2025-05-07T20:32:35.8190642Z D: int, 2025-05-07T20:32:35.8190741Z scale_ub: Optional[float], 2025-05-07T20:32:35.8190833Z contiguous: bool, 2025-05-07T20:32:35.8190917Z compiled: bool, 2025-05-07T20:32:35.8190992Z ) -> None: 2025-05-07T20:32:35.8191087Z torch.manual_seed(2025) 2025-05-07T20:32:35.8191163Z 2025-05-07T20:32:35.8191332Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.8193105Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:35.8193113Z 2025-05-07T20:32:35.8193229Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:35.8193234Z 2025-05-07T20:32:35.8193338Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.8193555Z self=, 2025-05-07T20:32:35.8193723Z T=16384, 2025-05-07T20:32:35.8193798Z D=7168, 2025-05-07T20:32:35.8193879Z scale_ub=None, 2025-05-07T20:32:35.8193966Z contiguous=True, 2025-05-07T20:32:35.8194049Z compiled=False, 2025-05-07T20:32:35.8194120Z ) 2025-05-07T20:32:35.8194338Z self = 2025-05-07T20:32:35.8194511Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:35.8194515Z 2025-05-07T20:32:35.8194591Z @given( 2025-05-07T20:32:35.8194774Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.8194871Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.8195038Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.8195158Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.8195272Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.8195348Z ) 2025-05-07T20:32:35.8195602Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.8195694Z def test_silu_mul_quant( 2025-05-07T20:32:35.8195775Z self, 2025-05-07T20:32:35.8195852Z T: int, 2025-05-07T20:32:35.8195929Z D: int, 2025-05-07T20:32:35.8196031Z scale_ub: Optional[float], 2025-05-07T20:32:35.8196118Z contiguous: bool, 2025-05-07T20:32:35.8196203Z compiled: bool, 2025-05-07T20:32:35.8196286Z ) -> None: 2025-05-07T20:32:35.8196376Z torch.manual_seed(2025) 2025-05-07T20:32:35.8196451Z 2025-05-07T20:32:35.8196619Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.8198442Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:35.8198455Z 2025-05-07T20:32:35.8198574Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:35.8198578Z 2025-05-07T20:32:35.8198682Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.8198906Z self=, 2025-05-07T20:32:35.8198987Z T=16384, 2025-05-07T20:32:35.8199062Z D=7168, 2025-05-07T20:32:35.8199146Z scale_ub=1200.0, 2025-05-07T20:32:35.8199232Z contiguous=True, 2025-05-07T20:32:35.8199315Z compiled=False, 2025-05-07T20:32:35.8199389Z ) 2025-05-07T20:32:35.8199604Z self = 2025-05-07T20:32:35.8202624Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:35.8202640Z 2025-05-07T20:32:35.8202725Z @given( 2025-05-07T20:32:35.8202848Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.8202954Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.8203071Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.8203193Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.8203308Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.8203385Z ) 2025-05-07T20:32:35.8203640Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.8203739Z def test_silu_mul_quant( 2025-05-07T20:32:35.8203819Z self, 2025-05-07T20:32:35.8203901Z T: int, 2025-05-07T20:32:35.8203978Z D: int, 2025-05-07T20:32:35.8204079Z scale_ub: Optional[float], 2025-05-07T20:32:35.8204175Z contiguous: bool, 2025-05-07T20:32:35.8204261Z compiled: bool, 2025-05-07T20:32:35.8204433Z ) -> None: 2025-05-07T20:32:35.8204534Z torch.manual_seed(2025) 2025-05-07T20:32:35.8204608Z 2025-05-07T20:32:35.8204786Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.8206607Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:35.8206650Z 2025-05-07T20:32:35.8206776Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:35.8206781Z 2025-05-07T20:32:35.8206885Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.8207114Z self=, 2025-05-07T20:32:35.8207195Z T=128, 2025-05-07T20:32:35.8207272Z D=5120, 2025-05-07T20:32:35.8207357Z scale_ub=1200.0, 2025-05-07T20:32:35.8207449Z contiguous=False, 2025-05-07T20:32:35.8207532Z compiled=False, 2025-05-07T20:32:35.8207607Z ) 2025-05-07T20:32:35.8207827Z self = 2025-05-07T20:32:35.8208001Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:35.8208008Z 2025-05-07T20:32:35.8208087Z @given( 2025-05-07T20:32:35.8208208Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.8208308Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.8208433Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.8208556Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.8208672Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.8208793Z ) 2025-05-07T20:32:35.8209046Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.8209145Z def test_silu_mul_quant( 2025-05-07T20:32:35.8209223Z self, 2025-05-07T20:32:35.8209303Z T: int, 2025-05-07T20:32:35.8209386Z D: int, 2025-05-07T20:32:35.8209487Z scale_ub: Optional[float], 2025-05-07T20:32:35.8209579Z contiguous: bool, 2025-05-07T20:32:35.8209668Z compiled: bool, 2025-05-07T20:32:35.8209747Z ) -> None: 2025-05-07T20:32:35.8209847Z torch.manual_seed(2025) 2025-05-07T20:32:35.8209923Z 2025-05-07T20:32:35.8210095Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.8210169Z 2025-05-07T20:32:35.8210272Z x_sign = torch.sign(x) 2025-05-07T20:32:35.8210400Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.8210496Z x = x_sign * x_clamp 2025-05-07T20:32:35.8210580Z x0 = x[:, :D] 2025-05-07T20:32:35.8210663Z x1 = x[:, D:] 2025-05-07T20:32:35.8210742Z 2025-05-07T20:32:35.8210830Z if contiguous: 2025-05-07T20:32:35.8210925Z x0 = x0.contiguous() 2025-05-07T20:32:35.8211020Z x1 = x1.contiguous() 2025-05-07T20:32:35.8211092Z 2025-05-07T20:32:35.8211185Z if scale_ub is not None: 2025-05-07T20:32:35.8211300Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.8211438Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.8211516Z ) 2025-05-07T20:32:35.8211605Z else: 2025-05-07T20:32:35.8211701Z scale_ub_tensor = None 2025-05-07T20:32:35.8211776Z 2025-05-07T20:32:35.8211914Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.8212003Z op = silu_mul_quant 2025-05-07T20:32:35.8212093Z if compiled: 2025-05-07T20:32:35.8212197Z op = torch.compile(op) 2025-05-07T20:32:35.8212352Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.8212427Z 2025-05-07T20:32:35.8212523Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.8212529Z 2025-05-07T20:32:35.8212627Z moe/activation_test.py:117: 2025-05-07T20:32:35.8212762Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.8212869Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.8212972Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.8213480Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.8213621Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.8214023Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.8214254Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.8214600Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.8214703Z kernel = self.compile( 2025-05-07T20:32:35.8215094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.8215276Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.8215405Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.8215410Z 2025-05-07T20:32:35.8215617Z self = 2025-05-07T20:32:35.8216409Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.8216963Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fcec1229ea0>} 2025-05-07T20:32:35.8217731Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.8217925Z context = 2025-05-07T20:32:35.8217930Z 2025-05-07T20:32:35.8218100Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.8218374Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.8218486Z module_map=module_map) 2025-05-07T20:32:35.8218655Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.8218755Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.8218833Z E ^ 2025-05-07T20:32:35.8219197Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.8219204Z 2025-05-07T20:32:35.8219617Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.8219622Z 2025-05-07T20:32:35.8219732Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.8220064Z self=, 2025-05-07T20:32:35.8220145Z T=2048, 2025-05-07T20:32:35.8220229Z D=7168, 2025-05-07T20:32:35.8220312Z scale_ub=None, 2025-05-07T20:32:35.8220404Z contiguous=False, 2025-05-07T20:32:35.8220494Z compiled=False, 2025-05-07T20:32:35.8220566Z ) 2025-05-07T20:32:35.8220788Z self = 2025-05-07T20:32:35.8220969Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:35.8220974Z 2025-05-07T20:32:35.8221051Z @given( 2025-05-07T20:32:35.8221226Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.8221331Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.8221452Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.8221572Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.8221690Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.8221764Z ) 2025-05-07T20:32:35.8222021Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.8222117Z def test_silu_mul_quant( 2025-05-07T20:32:35.8222237Z self, 2025-05-07T20:32:35.8222319Z T: int, 2025-05-07T20:32:35.8222401Z D: int, 2025-05-07T20:32:35.8222549Z scale_ub: Optional[float], 2025-05-07T20:32:35.8222644Z contiguous: bool, 2025-05-07T20:32:35.8222731Z compiled: bool, 2025-05-07T20:32:35.8222814Z ) -> None: 2025-05-07T20:32:35.8222910Z torch.manual_seed(2025) 2025-05-07T20:32:35.8222990Z 2025-05-07T20:32:35.8223171Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.8224971Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:35.8224979Z 2025-05-07T20:32:35.8225105Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:35.8225110Z 2025-05-07T20:32:35.8225214Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.8225440Z self=, 2025-05-07T20:32:35.8225527Z T=128, 2025-05-07T20:32:35.8225641Z D=7168, 2025-05-07T20:32:35.8225726Z scale_ub=1200.0, 2025-05-07T20:32:35.8225815Z contiguous=True, 2025-05-07T20:32:35.8225907Z compiled=True, 2025-05-07T20:32:35.8225980Z ) 2025-05-07T20:32:35.8226200Z self = 2025-05-07T20:32:35.8226373Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:35.8226378Z 2025-05-07T20:32:35.8226454Z @given( 2025-05-07T20:32:35.8226575Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.8226682Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.8226802Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.8226922Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.8227037Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.8227111Z ) 2025-05-07T20:32:35.8227365Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.8227462Z def test_silu_mul_quant( 2025-05-07T20:32:35.8227539Z self, 2025-05-07T20:32:35.8227619Z T: int, 2025-05-07T20:32:35.8227695Z D: int, 2025-05-07T20:32:35.8227799Z scale_ub: Optional[float], 2025-05-07T20:32:35.8227890Z contiguous: bool, 2025-05-07T20:32:35.8227980Z compiled: bool, 2025-05-07T20:32:35.8228060Z ) -> None: 2025-05-07T20:32:35.8228156Z torch.manual_seed(2025) 2025-05-07T20:32:35.8228228Z 2025-05-07T20:32:35.8228403Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.8228480Z 2025-05-07T20:32:35.8228576Z x_sign = torch.sign(x) 2025-05-07T20:32:35.8228705Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.8228795Z x = x_sign * x_clamp 2025-05-07T20:32:35.8228875Z x0 = x[:, :D] 2025-05-07T20:32:35.8228959Z x1 = x[:, D:] 2025-05-07T20:32:35.8229081Z 2025-05-07T20:32:35.8229167Z if contiguous: 2025-05-07T20:32:35.8229263Z x0 = x0.contiguous() 2025-05-07T20:32:35.8229352Z x1 = x1.contiguous() 2025-05-07T20:32:35.8229429Z 2025-05-07T20:32:35.8229521Z if scale_ub is not None: 2025-05-07T20:32:35.8229626Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.8229766Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.8229842Z ) 2025-05-07T20:32:35.8229921Z else: 2025-05-07T20:32:35.8230017Z scale_ub_tensor = None 2025-05-07T20:32:35.8230134Z 2025-05-07T20:32:35.8230268Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.8230398Z op = silu_mul_quant 2025-05-07T20:32:35.8230486Z if compiled: 2025-05-07T20:32:35.8230587Z op = torch.compile(op) 2025-05-07T20:32:35.8230698Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.8230772Z 2025-05-07T20:32:35.8230870Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.8230877Z 2025-05-07T20:32:35.8230976Z moe/activation_test.py:117: 2025-05-07T20:32:35.8231105Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.8231213Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.8231316Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.8231699Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:35.8231796Z return fn(*args, **kwargs) 2025-05-07T20:32:35.8232310Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.8232418Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.8232787Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.8233017Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.8233435Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.8233533Z kernel = self.compile( 2025-05-07T20:32:35.8233921Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.8234100Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.8234230Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.8234237Z 2025-05-07T20:32:35.8234458Z self = 2025-05-07T20:32:35.8235260Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.8235789Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fcec122b7f0>} 2025-05-07T20:32:35.8236549Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.8236745Z context = 2025-05-07T20:32:35.8236749Z 2025-05-07T20:32:35.8236920Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.8237193Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.8237305Z module_map=module_map) 2025-05-07T20:32:35.8237468Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.8237567Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.8237690Z E ^ 2025-05-07T20:32:35.8238054Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.8238059Z 2025-05-07T20:32:35.8238484Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.8238492Z 2025-05-07T20:32:35.8238600Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.8238827Z self=, 2025-05-07T20:32:35.8238908Z T=128, 2025-05-07T20:32:35.8239027Z D=7168, 2025-05-07T20:32:35.8239113Z scale_ub=1200.0, 2025-05-07T20:32:35.8239203Z contiguous=True, 2025-05-07T20:32:35.8239324Z compiled=False, 2025-05-07T20:32:35.8239399Z ) 2025-05-07T20:32:35.8239623Z self = 2025-05-07T20:32:35.8239801Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:35.8239808Z 2025-05-07T20:32:35.8239888Z @given( 2025-05-07T20:32:35.8240013Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.8240115Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.8240234Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.8240353Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.8240469Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.8240545Z ) 2025-05-07T20:32:35.8240796Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.8240895Z def test_silu_mul_quant( 2025-05-07T20:32:35.8240976Z self, 2025-05-07T20:32:35.8241053Z T: int, 2025-05-07T20:32:35.8241133Z D: int, 2025-05-07T20:32:35.8241239Z scale_ub: Optional[float], 2025-05-07T20:32:35.8241330Z contiguous: bool, 2025-05-07T20:32:35.8241419Z compiled: bool, 2025-05-07T20:32:35.8241497Z ) -> None: 2025-05-07T20:32:35.8241596Z torch.manual_seed(2025) 2025-05-07T20:32:35.8241712Z 2025-05-07T20:32:35.8241886Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.8241962Z 2025-05-07T20:32:35.8242061Z x_sign = torch.sign(x) 2025-05-07T20:32:35.8242191Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.8243999Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:35.8244010Z 2025-05-07T20:32:35.8244133Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:35.8244140Z 2025-05-07T20:32:35.8244246Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.8244477Z self=, 2025-05-07T20:32:35.8244554Z T=128, 2025-05-07T20:32:35.8244635Z D=5120, 2025-05-07T20:32:35.8244719Z scale_ub=1200.0, 2025-05-07T20:32:35.8244807Z contiguous=True, 2025-05-07T20:32:35.8244896Z compiled=True, 2025-05-07T20:32:35.8244970Z ) 2025-05-07T20:32:35.8245188Z self = 2025-05-07T20:32:35.8245369Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:35.8245374Z 2025-05-07T20:32:35.8245451Z @given( 2025-05-07T20:32:35.8245570Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.8245672Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.8245788Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.8245958Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.8246077Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.8246152Z ) 2025-05-07T20:32:35.8246403Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.8246499Z def test_silu_mul_quant( 2025-05-07T20:32:35.8246577Z self, 2025-05-07T20:32:35.8246659Z T: int, 2025-05-07T20:32:35.8246735Z D: int, 2025-05-07T20:32:35.8246836Z scale_ub: Optional[float], 2025-05-07T20:32:35.8246974Z contiguous: bool, 2025-05-07T20:32:35.8247061Z compiled: bool, 2025-05-07T20:32:35.8247140Z ) -> None: 2025-05-07T20:32:35.8247279Z torch.manual_seed(2025) 2025-05-07T20:32:35.8247356Z 2025-05-07T20:32:35.8247527Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.8247605Z 2025-05-07T20:32:35.8247696Z x_sign = torch.sign(x) 2025-05-07T20:32:35.8247831Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.8249625Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:35.8249635Z 2025-05-07T20:32:35.8249765Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:35.8249770Z 2025-05-07T20:32:35.8249872Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.8250099Z self=, 2025-05-07T20:32:35.8250177Z T=128, 2025-05-07T20:32:35.8250257Z D=7168, 2025-05-07T20:32:35.8250380Z scale_ub=None, 2025-05-07T20:32:35.8250472Z contiguous=True, 2025-05-07T20:32:35.8250557Z compiled=True, 2025-05-07T20:32:35.8250631Z ) 2025-05-07T20:32:35.8250851Z self = 2025-05-07T20:32:35.8251019Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:35.8251024Z 2025-05-07T20:32:35.8251103Z @given( 2025-05-07T20:32:35.8251222Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.8251324Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.8251447Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.8251568Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.8251684Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.8251761Z ) 2025-05-07T20:32:35.8252014Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.8252119Z def test_silu_mul_quant( 2025-05-07T20:32:35.8252197Z self, 2025-05-07T20:32:35.8252280Z T: int, 2025-05-07T20:32:35.8252360Z D: int, 2025-05-07T20:32:35.8252462Z scale_ub: Optional[float], 2025-05-07T20:32:35.8252553Z contiguous: bool, 2025-05-07T20:32:35.8252642Z compiled: bool, 2025-05-07T20:32:35.8252721Z ) -> None: 2025-05-07T20:32:35.8252820Z torch.manual_seed(2025) 2025-05-07T20:32:35.8252895Z 2025-05-07T20:32:35.8253066Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.8254879Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:35.8254929Z 2025-05-07T20:32:35.8255053Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:35.8255190Z =============================== warnings summary =============================== 2025-05-07T20:32:35.8255510Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:32:35.8255822Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:32:35.8256210Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:32:35.8257104Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details. 2025-05-07T20:32:35.8257343Z warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See " 2025-05-07T20:32:35.8257348Z 2025-05-07T20:32:35.8257565Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html 2025-05-07T20:32:35.8257735Z ================= 1 failed, 1 deselected, 3 warnings in 17.45s ================= 2025-05-07T20:32:37.3693197Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py` failed. (See above for error) 2025-05-07T20:32:37.4309890Z [EXEC] [ATTEMPT 1/2] Command attempt failed. 2025-05-07T20:32:37.4310136Z 2025-05-07T20:32:39.4328078Z [EXEC] [ATTEMPT 2/2] + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py 2025-05-07T20:32:41.5837370Z ============================= test session starts ============================== 2025-05-07T20:32:41.5838029Z platform linux -- Python 3.10.13, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:32:41.5838559Z cachedir: .pytest_cache 2025-05-07T20:32:41.5839145Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:32:41.5839874Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:32:41.5840291Z plugins: hypothesis-6.131.14 2025-05-07T20:32:43.1734063Z TMA benchmarks will be running with experimental grid constant TMA descriptor. 2025-05-07T20:32:43.3491281Z collecting ... collected 2 items / 1 deselected / 1 selected 2025-05-07T20:32:43.3491695Z run-last-failure: rerun previous 1 failure 2025-05-07T20:32:43.3491912Z 2025-05-07T20:32:45.8639614Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant( 2025-05-07T20:32:45.8640433Z self=, 2025-05-07T20:32:45.8640858Z T=1, 2025-05-07T20:32:45.8641047Z D=5120, 2025-05-07T20:32:45.8641249Z scale_ub=None, 2025-05-07T20:32:45.8641474Z contiguous=True, 2025-05-07T20:32:45.8641699Z compiled=True, 2025-05-07T20:32:45.8641916Z ) 2025-05-07T20:32:45.8642250Z self = 2025-05-07T20:32:45.8642742Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:45.8643019Z 2025-05-07T20:32:45.8643097Z @given( 2025-05-07T20:32:45.8643334Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:45.8643662Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:45.8643970Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:45.8644314Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:45.8644651Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:45.8645241Z ) 2025-05-07T20:32:45.8645606Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:45.8646057Z def test_silu_mul_quant( 2025-05-07T20:32:45.8646297Z self, 2025-05-07T20:32:45.8646499Z T: int, 2025-05-07T20:32:45.8646700Z D: int, 2025-05-07T20:32:45.8646922Z scale_ub: Optional[float], 2025-05-07T20:32:45.8647201Z contiguous: bool, 2025-05-07T20:32:45.8647449Z compiled: bool, 2025-05-07T20:32:45.8647677Z ) -> None: 2025-05-07T20:32:45.8647997Z torch.manual_seed(2025) 2025-05-07T20:32:45.8648245Z 2025-05-07T20:32:45.8648609Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:45.8648956Z 2025-05-07T20:32:45.8649158Z x_sign = torch.sign(x) 2025-05-07T20:32:45.8649460Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:45.8649772Z x = x_sign * x_clamp 2025-05-07T20:32:45.8650025Z x0 = x[:, :D] 2025-05-07T20:32:45.8650253Z x1 = x[:, D:] 2025-05-07T20:32:45.8650462Z 2025-05-07T20:32:45.8650655Z if contiguous: 2025-05-07T20:32:45.8650902Z x0 = x0.contiguous() 2025-05-07T20:32:45.8651160Z x1 = x1.contiguous() 2025-05-07T20:32:45.8651407Z 2025-05-07T20:32:45.8651606Z if scale_ub is not None: 2025-05-07T20:32:45.8651881Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:45.8652225Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:45.8652542Z ) 2025-05-07T20:32:45.8652738Z else: 2025-05-07T20:32:45.8652958Z scale_ub_tensor = None 2025-05-07T20:32:45.8653218Z 2025-05-07T20:32:45.8653452Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:45.8653770Z op = silu_mul_quant 2025-05-07T20:32:45.8654031Z if compiled: 2025-05-07T20:32:45.8654286Z op = torch.compile(op) 2025-05-07T20:32:45.8654672Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:45.8654951Z 2025-05-07T20:32:45.8655151Z y_fp8, y_scale = fn() 2025-05-07T20:32:45.8655435Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:45.8655730Z 2025-05-07T20:32:45.8655974Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:45.8656306Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:45.8656603Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:45.8656927Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:45.8657292Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:45.8657607Z 2025-05-07T20:32:45.8657819Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:45.8658017Z 2025-05-07T20:32:45.8658126Z moe/activation_test.py:126: 2025-05-07T20:32:45.8658423Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:45.8658766Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:45.8659104Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:45.8660078Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:45.8661185Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:45.8661979Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:45.8662918Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:45.8663626Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:45.8664367Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:45.8665136Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:45.8665965Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:45.8666702Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:45.8667356Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:45.8667970Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:45.8668491Z fn() 2025-05-07T20:32:45.8669056Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:45.8669690Z self.fn.run( 2025-05-07T20:32:45.8670185Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:45.8670723Z kernel = self.compile( 2025-05-07T20:32:45.8671280Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:45.8671952Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:45.8672348Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:45.8672588Z 2025-05-07T20:32:45.8672800Z self = 2025-05-07T20:32:45.8673910Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:45.8675331Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f32572a4af0>} 2025-05-07T20:32:45.8676768Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:45.8677819Z context = 2025-05-07T20:32:45.8678118Z 2025-05-07T20:32:45.8678288Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:45.8678822Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:45.8679313Z module_map=module_map) 2025-05-07T20:32:45.8679683Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:45.8680051Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:45.8680321Z E ^ 2025-05-07T20:32:45.8680792Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:45.8681257Z 2025-05-07T20:32:45.8681682Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:45.8682213Z 2025-05-07T20:32:45.8682319Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:45.8682740Z self=, 2025-05-07T20:32:45.8683140Z T=2048, 2025-05-07T20:32:45.8683333Z D=5120, 2025-05-07T20:32:45.8683529Z scale_ub=1200.0, 2025-05-07T20:32:45.8683753Z contiguous=True, 2025-05-07T20:32:45.8683982Z compiled=False, 2025-05-07T20:32:45.8684193Z ) 2025-05-07T20:32:47.2038163Z self = 2025-05-07T20:32:47.2038788Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:47.2039067Z 2025-05-07T20:32:47.2039163Z @given( 2025-05-07T20:32:47.2039412Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:47.2039733Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:47.2040044Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:47.2040606Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:47.2040933Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:47.2041218Z ) 2025-05-07T20:32:47.2041577Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:47.2042024Z def test_silu_mul_quant( 2025-05-07T20:32:47.2042271Z self, 2025-05-07T20:32:47.2042475Z T: int, 2025-05-07T20:32:47.2042671Z D: int, 2025-05-07T20:32:47.2042892Z scale_ub: Optional[float], 2025-05-07T20:32:47.2043258Z contiguous: bool, 2025-05-07T20:32:47.2043496Z compiled: bool, 2025-05-07T20:32:47.2043727Z ) -> None: 2025-05-07T20:32:47.2044018Z torch.manual_seed(2025) 2025-05-07T20:32:47.2044252Z 2025-05-07T20:32:47.2044530Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:47.2044872Z 2025-05-07T20:32:47.2045068Z x_sign = torch.sign(x) 2025-05-07T20:32:47.2045362Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:47.2045674Z x = x_sign * x_clamp 2025-05-07T20:32:47.2045918Z x0 = x[:, :D] 2025-05-07T20:32:47.2046129Z x1 = x[:, D:] 2025-05-07T20:32:47.2046338Z 2025-05-07T20:32:47.2046526Z if contiguous: 2025-05-07T20:32:47.2046759Z x0 = x0.contiguous() 2025-05-07T20:32:47.2047017Z x1 = x1.contiguous() 2025-05-07T20:32:47.2047257Z 2025-05-07T20:32:47.2047444Z if scale_ub is not None: 2025-05-07T20:32:47.2047718Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:47.2048058Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:47.2048358Z ) 2025-05-07T20:32:47.2048555Z else: 2025-05-07T20:32:47.2048769Z scale_ub_tensor = None 2025-05-07T20:32:47.2049014Z 2025-05-07T20:32:47.2049244Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:47.2049562Z op = silu_mul_quant 2025-05-07T20:32:47.2049816Z if compiled: 2025-05-07T20:32:47.2050136Z op = torch.compile(op) 2025-05-07T20:32:47.2050444Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.2050719Z 2025-05-07T20:32:47.2050911Z > y_fp8, y_scale = fn() 2025-05-07T20:32:47.2051084Z 2025-05-07T20:32:47.2051187Z moe/activation_test.py:117: 2025-05-07T20:32:47.2051489Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.2051819Z moe/activation_test.py:115: in fn 2025-05-07T20:32:47.2052104Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.2052808Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:47.2053500Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:47.2054034Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:47.2054721Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:47.2055392Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:47.2055918Z kernel = self.compile( 2025-05-07T20:32:47.2056465Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:47.2057120Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:47.2063592Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.2063846Z 2025-05-07T20:32:47.2064065Z self = 2025-05-07T20:32:47.2065145Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:47.2066678Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3257181990>} 2025-05-07T20:32:47.2068020Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:47.2069042Z context = 2025-05-07T20:32:47.2069329Z 2025-05-07T20:32:47.2069555Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:47.2070119Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:47.2070589Z module_map=module_map) 2025-05-07T20:32:47.2070962Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:47.2071321Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:47.2071586Z E ^ 2025-05-07T20:32:47.2072056Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:47.2072503Z 2025-05-07T20:32:47.2072927Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:47.2073437Z 2025-05-07T20:32:47.2073554Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:47.2073963Z self=, 2025-05-07T20:32:47.2074375Z T=2048, 2025-05-07T20:32:47.2074575Z D=5120, 2025-05-07T20:32:47.2074768Z scale_ub=1200.0, 2025-05-07T20:32:47.2075027Z contiguous=True, 2025-05-07T20:32:47.2075264Z compiled=True, 2025-05-07T20:32:47.2075480Z ) 2025-05-07T20:32:47.2075798Z self = 2025-05-07T20:32:47.2076294Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:47.2076618Z 2025-05-07T20:32:47.2076709Z @given( 2025-05-07T20:32:47.2076941Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:47.2077260Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:47.2077572Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:47.2077907Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:47.2078230Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:47.2078519Z ) 2025-05-07T20:32:47.2078881Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:47.2079323Z def test_silu_mul_quant( 2025-05-07T20:32:47.2079579Z self, 2025-05-07T20:32:47.2079788Z T: int, 2025-05-07T20:32:47.2079992Z D: int, 2025-05-07T20:32:47.2080222Z scale_ub: Optional[float], 2025-05-07T20:32:47.2080504Z contiguous: bool, 2025-05-07T20:32:47.2080740Z compiled: bool, 2025-05-07T20:32:47.2080981Z ) -> None: 2025-05-07T20:32:47.2081206Z torch.manual_seed(2025) 2025-05-07T20:32:47.2081448Z 2025-05-07T20:32:47.2081726Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:47.2082072Z 2025-05-07T20:32:47.2082272Z x_sign = torch.sign(x) 2025-05-07T20:32:47.2082562Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:47.2082877Z x = x_sign * x_clamp 2025-05-07T20:32:47.2083123Z x0 = x[:, :D] 2025-05-07T20:32:47.2083340Z x1 = x[:, D:] 2025-05-07T20:32:47.2083562Z 2025-05-07T20:32:47.2083755Z if contiguous: 2025-05-07T20:32:47.2083990Z x0 = x0.contiguous() 2025-05-07T20:32:47.2084265Z x1 = x1.contiguous() 2025-05-07T20:32:47.2084513Z 2025-05-07T20:32:47.2084727Z if scale_ub is not None: 2025-05-07T20:32:47.2085035Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:47.2085375Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:47.2085742Z ) 2025-05-07T20:32:47.2085944Z else: 2025-05-07T20:32:47.2086163Z scale_ub_tensor = None 2025-05-07T20:32:47.2086415Z 2025-05-07T20:32:47.2086654Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:47.2086980Z op = silu_mul_quant 2025-05-07T20:32:47.2087233Z if compiled: 2025-05-07T20:32:47.2087490Z op = torch.compile(op) 2025-05-07T20:32:47.2087792Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.2088071Z 2025-05-07T20:32:47.2088312Z y_fp8, y_scale = fn() 2025-05-07T20:32:47.2088605Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:47.2088933Z 2025-05-07T20:32:47.2089179Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:47.2089513Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:47.2089801Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:47.2090523Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:47.2090894Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:47.2091200Z 2025-05-07T20:32:47.2091409Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:47.2091603Z 2025-05-07T20:32:47.2091710Z moe/activation_test.py:126: 2025-05-07T20:32:47.2092006Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.2092343Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:47.2092679Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:47.2093472Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:47.2094220Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:47.2094798Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:47.2095593Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:47.2096288Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:47.2097004Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:47.2097759Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:47.2098511Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:47.2099237Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:47.2099974Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:47.2100591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:47.2101112Z fn() 2025-05-07T20:32:47.2101632Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:47.2102222Z self.fn.run( 2025-05-07T20:32:47.2102688Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:47.2103214Z kernel = self.compile( 2025-05-07T20:32:47.2103757Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:47.2104419Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:47.2104824Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.2105057Z 2025-05-07T20:32:47.2105266Z self = 2025-05-07T20:32:47.2106342Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:47.2107814Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3255c1d3f0>} 2025-05-07T20:32:47.2109145Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:47.2110238Z context = 2025-05-07T20:32:47.2110522Z 2025-05-07T20:32:47.2110744Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:47.2111268Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:47.2111737Z module_map=module_map) 2025-05-07T20:32:47.2112107Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:47.2112463Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:47.2112730Z E ^ 2025-05-07T20:32:47.2113201Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:47.2113646Z 2025-05-07T20:32:47.2114067Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:47.2114614Z 2025-05-07T20:32:47.2114741Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:47.2115163Z self=, 2025-05-07T20:32:47.2115570Z T=16384, 2025-05-07T20:32:47.2115759Z D=7168, 2025-05-07T20:32:47.2115960Z scale_ub=1200.0, 2025-05-07T20:32:47.2116192Z contiguous=False, 2025-05-07T20:32:47.2116417Z compiled=False, 2025-05-07T20:32:47.2116626Z ) 2025-05-07T20:32:48.3986714Z self = 2025-05-07T20:32:48.3987517Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:48.3987870Z 2025-05-07T20:32:48.3987955Z @given( 2025-05-07T20:32:48.3988192Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:48.3988509Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:48.3988810Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:48.3989141Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:48.3989470Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:48.3989757Z ) 2025-05-07T20:32:48.3990390Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:48.3990834Z def test_silu_mul_quant( 2025-05-07T20:32:48.3991069Z self, 2025-05-07T20:32:48.3991268Z T: int, 2025-05-07T20:32:48.3991472Z D: int, 2025-05-07T20:32:48.3991689Z scale_ub: Optional[float], 2025-05-07T20:32:48.3991969Z contiguous: bool, 2025-05-07T20:32:48.3992213Z compiled: bool, 2025-05-07T20:32:48.3992438Z ) -> None: 2025-05-07T20:32:48.3992657Z torch.manual_seed(2025) 2025-05-07T20:32:48.3992904Z 2025-05-07T20:32:48.3993184Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:48.3993520Z 2025-05-07T20:32:48.3993718Z x_sign = torch.sign(x) 2025-05-07T20:32:48.3994016Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:48.3994322Z x = x_sign * x_clamp 2025-05-07T20:32:48.3994573Z x0 = x[:, :D] 2025-05-07T20:32:48.3994796Z x1 = x[:, D:] 2025-05-07T20:32:48.3994999Z 2025-05-07T20:32:48.3995192Z if contiguous: 2025-05-07T20:32:48.3995428Z x0 = x0.contiguous() 2025-05-07T20:32:48.3995683Z x1 = x1.contiguous() 2025-05-07T20:32:48.3995923Z 2025-05-07T20:32:48.3996117Z if scale_ub is not None: 2025-05-07T20:32:48.3996518Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:48.3996860Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:48.3997171Z ) 2025-05-07T20:32:48.3997357Z else: 2025-05-07T20:32:48.3997572Z scale_ub_tensor = None 2025-05-07T20:32:48.3997828Z 2025-05-07T20:32:48.3998055Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:48.3998365Z op = silu_mul_quant 2025-05-07T20:32:48.3998619Z if compiled: 2025-05-07T20:32:48.3998874Z op = torch.compile(op) 2025-05-07T20:32:48.3999261Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:48.3999536Z 2025-05-07T20:32:48.3999806Z > y_fp8, y_scale = fn() 2025-05-07T20:32:48.3999975Z 2025-05-07T20:32:48.4000075Z moe/activation_test.py:117: 2025-05-07T20:32:48.4000376Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:48.4000708Z moe/activation_test.py:115: in fn 2025-05-07T20:32:48.4000992Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:48.4001684Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:48.4002375Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:48.4002910Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:48.4003586Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:48.4004249Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:48.4004787Z kernel = self.compile( 2025-05-07T20:32:48.4005325Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:48.4005981Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:48.4006440Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:48.4006671Z 2025-05-07T20:32:48.4006891Z self = 2025-05-07T20:32:48.4007983Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:48.4009361Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3255c1ce50>} 2025-05-07T20:32:48.4010705Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:48.4011715Z context = 2025-05-07T20:32:48.4012010Z 2025-05-07T20:32:48.4012180Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:48.4012706Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:48.4013175Z module_map=module_map) 2025-05-07T20:32:48.4013536Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:48.4013887Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:48.4014152Z E ^ 2025-05-07T20:32:48.4014610Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:48.4015065Z 2025-05-07T20:32:48.4015489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:48.4016004Z 2025-05-07T20:32:48.4016107Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:48.4016520Z self=, 2025-05-07T20:32:48.4017005Z T=1, 2025-05-07T20:32:48.4017196Z D=7168, 2025-05-07T20:32:48.4017391Z scale_ub=None, 2025-05-07T20:32:48.4017600Z contiguous=True, 2025-05-07T20:32:48.4017827Z compiled=True, 2025-05-07T20:32:48.4018037Z ) 2025-05-07T20:32:48.4018358Z self = 2025-05-07T20:32:48.4018839Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:48.4019098Z 2025-05-07T20:32:48.4019173Z @given( 2025-05-07T20:32:48.4019451Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:48.4019862Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:48.4020220Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:48.4020556Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:48.4020875Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:48.4021160Z ) 2025-05-07T20:32:48.4021517Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:48.4021949Z def test_silu_mul_quant( 2025-05-07T20:32:48.4022189Z self, 2025-05-07T20:32:48.4022387Z T: int, 2025-05-07T20:32:48.4022580Z D: int, 2025-05-07T20:32:48.4022802Z scale_ub: Optional[float], 2025-05-07T20:32:48.4023075Z contiguous: bool, 2025-05-07T20:32:48.4023319Z compiled: bool, 2025-05-07T20:32:48.4023539Z ) -> None: 2025-05-07T20:32:48.4023757Z torch.manual_seed(2025) 2025-05-07T20:32:48.4024002Z 2025-05-07T20:32:48.4024274Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:48.4024612Z 2025-05-07T20:32:48.4024808Z x_sign = torch.sign(x) 2025-05-07T20:32:48.4025094Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:48.4025404Z x = x_sign * x_clamp 2025-05-07T20:32:48.4025649Z x0 = x[:, :D] 2025-05-07T20:32:48.4025864Z x1 = x[:, D:] 2025-05-07T20:32:48.4026073Z 2025-05-07T20:32:48.4026311Z if contiguous: 2025-05-07T20:32:48.4026544Z x0 = x0.contiguous() 2025-05-07T20:32:48.4026805Z x1 = x1.contiguous() 2025-05-07T20:32:48.4027049Z 2025-05-07T20:32:48.4027237Z if scale_ub is not None: 2025-05-07T20:32:48.4027514Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:48.4027849Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:48.4028148Z ) 2025-05-07T20:32:48.4028341Z else: 2025-05-07T20:32:48.4028558Z scale_ub_tensor = None 2025-05-07T20:32:48.4028812Z 2025-05-07T20:32:48.4029039Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:48.4029359Z op = silu_mul_quant 2025-05-07T20:32:48.4029615Z if compiled: 2025-05-07T20:32:48.4029864Z op = torch.compile(op) 2025-05-07T20:32:48.4030163Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:48.4030438Z 2025-05-07T20:32:48.4030631Z y_fp8, y_scale = fn() 2025-05-07T20:32:48.4030916Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:48.4031204Z 2025-05-07T20:32:48.4031436Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:48.4031769Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:48.4032062Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:48.4032372Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:48.4032729Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:48.4033040Z 2025-05-07T20:32:48.4033241Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:48.4033435Z 2025-05-07T20:32:48.4033538Z moe/activation_test.py:126: 2025-05-07T20:32:48.4033839Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:48.4034173Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:48.4034495Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:48.4035333Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:48.4036083Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:48.4036627Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:48.4037300Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:48.4037985Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:48.4038787Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:48.4039538Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:48.4040278Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:48.4041005Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:48.4041639Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:48.4042231Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:48.4042745Z fn() 2025-05-07T20:32:48.4043258Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:48.4043839Z self.fn.run( 2025-05-07T20:32:48.4044304Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:48.4044831Z kernel = self.compile( 2025-05-07T20:32:48.4045377Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:48.4046077Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:48.4046479Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:48.4046711Z 2025-05-07T20:32:48.4046917Z self = 2025-05-07T20:32:48.4047986Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:48.4049351Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f32559b8c10>} 2025-05-07T20:32:48.4050675Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:48.4051695Z context = 2025-05-07T20:32:48.4051980Z 2025-05-07T20:32:48.4052154Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:48.4052676Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:48.4053132Z module_map=module_map) 2025-05-07T20:32:48.4053499Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:48.4053853Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:48.4054117Z E ^ 2025-05-07T20:32:48.4054580Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:48.4055024Z 2025-05-07T20:32:48.4055443Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:48.4055957Z 2025-05-07T20:32:48.4056121Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:48.4056530Z self=, 2025-05-07T20:32:48.4056929Z T=4096, 2025-05-07T20:32:48.4057119Z D=5120, 2025-05-07T20:32:48.4057307Z scale_ub=None, 2025-05-07T20:32:48.4057522Z contiguous=False, 2025-05-07T20:32:48.4057748Z compiled=False, 2025-05-07T20:32:48.4057944Z ) 2025-05-07T20:32:49.9554273Z self = 2025-05-07T20:32:49.9555074Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:49.9555729Z 2025-05-07T20:32:49.9555820Z @given( 2025-05-07T20:32:49.9556169Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:49.9556494Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:49.9556820Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:49.9557176Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:49.9557527Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:49.9557825Z ) 2025-05-07T20:32:49.9558190Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:49.9558630Z def test_silu_mul_quant( 2025-05-07T20:32:49.9558890Z self, 2025-05-07T20:32:49.9559096Z T: int, 2025-05-07T20:32:49.9559296Z D: int, 2025-05-07T20:32:49.9559530Z scale_ub: Optional[float], 2025-05-07T20:32:49.9559816Z contiguous: bool, 2025-05-07T20:32:49.9560070Z compiled: bool, 2025-05-07T20:32:49.9560308Z ) -> None: 2025-05-07T20:32:49.9560536Z torch.manual_seed(2025) 2025-05-07T20:32:49.9560787Z 2025-05-07T20:32:49.9561065Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:49.9561416Z 2025-05-07T20:32:49.9561616Z x_sign = torch.sign(x) 2025-05-07T20:32:49.9561907Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:49.9562219Z x = x_sign * x_clamp 2025-05-07T20:32:49.9562566Z x0 = x[:, :D] 2025-05-07T20:32:49.9562788Z x1 = x[:, D:] 2025-05-07T20:32:49.9563005Z 2025-05-07T20:32:49.9563202Z if contiguous: 2025-05-07T20:32:49.9563437Z x0 = x0.contiguous() 2025-05-07T20:32:49.9563703Z x1 = x1.contiguous() 2025-05-07T20:32:49.9563949Z 2025-05-07T20:32:49.9564141Z if scale_ub is not None: 2025-05-07T20:32:49.9564419Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:49.9564762Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:49.9565080Z ) 2025-05-07T20:32:49.9565293Z else: 2025-05-07T20:32:49.9565541Z scale_ub_tensor = None 2025-05-07T20:32:49.9565802Z 2025-05-07T20:32:49.9566034Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:49.9566357Z op = silu_mul_quant 2025-05-07T20:32:49.9566618Z if compiled: 2025-05-07T20:32:49.9566866Z op = torch.compile(op) 2025-05-07T20:32:49.9567179Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:49.9567457Z 2025-05-07T20:32:49.9567648Z > y_fp8, y_scale = fn() 2025-05-07T20:32:49.9567824Z 2025-05-07T20:32:49.9567929Z moe/activation_test.py:117: 2025-05-07T20:32:49.9568232Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:49.9568561Z moe/activation_test.py:115: in fn 2025-05-07T20:32:49.9568853Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:49.9569553Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:49.9570254Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:49.9570787Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:49.9571477Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:49.9572252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:49.9572786Z kernel = self.compile( 2025-05-07T20:32:49.9573319Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:49.9573974Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:49.9574371Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:49.9574595Z 2025-05-07T20:32:49.9574856Z self = 2025-05-07T20:32:49.9575971Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:49.9577390Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f32559b9a20>} 2025-05-07T20:32:49.9584814Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:49.9585851Z context = 2025-05-07T20:32:49.9586139Z 2025-05-07T20:32:49.9586322Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:49.9586862Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:49.9587340Z module_map=module_map) 2025-05-07T20:32:49.9587722Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:49.9588087Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:49.9588350Z E ^ 2025-05-07T20:32:49.9588912Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:49.9589373Z 2025-05-07T20:32:49.9590131Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:49.9590684Z 2025-05-07T20:32:49.9590796Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:49.9591220Z self=, 2025-05-07T20:32:49.9591628Z T=4096, 2025-05-07T20:32:49.9591833Z D=7168, 2025-05-07T20:32:49.9592033Z scale_ub=None, 2025-05-07T20:32:49.9592260Z contiguous=False, 2025-05-07T20:32:49.9592498Z compiled=False, 2025-05-07T20:32:49.9592714Z ) 2025-05-07T20:32:49.9593043Z self = 2025-05-07T20:32:49.9593547Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:49.9593819Z 2025-05-07T20:32:49.9593901Z @given( 2025-05-07T20:32:49.9594144Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:49.9594467Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:49.9594774Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:49.9595112Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:49.9595446Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:49.9595741Z ) 2025-05-07T20:32:49.9596089Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:49.9596535Z def test_silu_mul_quant( 2025-05-07T20:32:49.9596784Z self, 2025-05-07T20:32:49.9596979Z T: int, 2025-05-07T20:32:49.9597189Z D: int, 2025-05-07T20:32:49.9597422Z scale_ub: Optional[float], 2025-05-07T20:32:49.9597700Z contiguous: bool, 2025-05-07T20:32:49.9597955Z compiled: bool, 2025-05-07T20:32:49.9598190Z ) -> None: 2025-05-07T20:32:49.9598407Z torch.manual_seed(2025) 2025-05-07T20:32:49.9598779Z 2025-05-07T20:32:49.9599064Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:49.9599406Z 2025-05-07T20:32:49.9599610Z x_sign = torch.sign(x) 2025-05-07T20:32:49.9599911Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:49.9600231Z x = x_sign * x_clamp 2025-05-07T20:32:49.9600475Z x0 = x[:, :D] 2025-05-07T20:32:49.9600704Z x1 = x[:, D:] 2025-05-07T20:32:49.9600919Z 2025-05-07T20:32:49.9601107Z if contiguous: 2025-05-07T20:32:49.9601352Z x0 = x0.contiguous() 2025-05-07T20:32:49.9601699Z x1 = x1.contiguous() 2025-05-07T20:32:49.9601937Z 2025-05-07T20:32:49.9602202Z if scale_ub is not None: 2025-05-07T20:32:49.9602491Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:49.9602832Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:49.9603149Z ) 2025-05-07T20:32:49.9603356Z else: 2025-05-07T20:32:49.9603575Z scale_ub_tensor = None 2025-05-07T20:32:49.9603840Z 2025-05-07T20:32:49.9604084Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:49.9604397Z op = silu_mul_quant 2025-05-07T20:32:49.9604661Z if compiled: 2025-05-07T20:32:49.9604925Z op = torch.compile(op) 2025-05-07T20:32:49.9605224Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:49.9605512Z 2025-05-07T20:32:49.9605720Z > y_fp8, y_scale = fn() 2025-05-07T20:32:49.9605890Z 2025-05-07T20:32:49.9606009Z moe/activation_test.py:117: 2025-05-07T20:32:49.9606305Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:49.9606649Z moe/activation_test.py:115: in fn 2025-05-07T20:32:49.9606945Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:49.9607632Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:49.9608402Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:49.9608951Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:49.9609636Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:49.9610299Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:49.9610836Z kernel = self.compile( 2025-05-07T20:32:49.9611373Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:49.9612036Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:49.9612440Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:49.9612666Z 2025-05-07T20:32:49.9612875Z self = 2025-05-07T20:32:49.9613953Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:49.9615341Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f32559ba560>} 2025-05-07T20:32:49.9616677Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:49.9617701Z context = 2025-05-07T20:32:49.9617985Z 2025-05-07T20:32:49.9618151Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:49.9618674Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:49.9619197Z module_map=module_map) 2025-05-07T20:32:49.9619568Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:49.9620009Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:49.9620278Z E ^ 2025-05-07T20:32:49.9620751Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:49.9621193Z 2025-05-07T20:32:49.9621610Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:49.9622175Z 2025-05-07T20:32:49.9622283Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:49.9622751Z self=, 2025-05-07T20:32:49.9623158Z T=128, 2025-05-07T20:32:49.9623349Z D=7168, 2025-05-07T20:32:49.9623550Z scale_ub=None, 2025-05-07T20:32:49.9623778Z contiguous=False, 2025-05-07T20:32:49.9624008Z compiled=True, 2025-05-07T20:32:49.9624219Z ) 2025-05-07T20:32:50.0250210Z self = 2025-05-07T20:32:50.0250952Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:50.0251327Z 2025-05-07T20:32:50.0251422Z @given( 2025-05-07T20:32:50.0251668Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:50.0251989Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:50.0252304Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:50.0252654Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:50.0252991Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:50.0253291Z ) 2025-05-07T20:32:50.0253640Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:50.0254086Z def test_silu_mul_quant( 2025-05-07T20:32:50.0254335Z self, 2025-05-07T20:32:50.0254539Z T: int, 2025-05-07T20:32:50.0255057Z D: int, 2025-05-07T20:32:50.0255294Z scale_ub: Optional[float], 2025-05-07T20:32:50.0255568Z contiguous: bool, 2025-05-07T20:32:50.0255821Z compiled: bool, 2025-05-07T20:32:50.0256059Z ) -> None: 2025-05-07T20:32:50.0256275Z torch.manual_seed(2025) 2025-05-07T20:32:50.0256527Z 2025-05-07T20:32:50.0256811Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:50.0257162Z 2025-05-07T20:32:50.0257352Z x_sign = torch.sign(x) 2025-05-07T20:32:50.0257655Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:50.0257967Z x = x_sign * x_clamp 2025-05-07T20:32:50.0258207Z x0 = x[:, :D] 2025-05-07T20:32:50.0258433Z x1 = x[:, D:] 2025-05-07T20:32:50.0258652Z 2025-05-07T20:32:50.0258843Z if contiguous: 2025-05-07T20:32:50.0259082Z x0 = x0.contiguous() 2025-05-07T20:32:50.0259343Z x1 = x1.contiguous() 2025-05-07T20:32:50.0259590Z 2025-05-07T20:32:50.0259896Z if scale_ub is not None: 2025-05-07T20:32:50.0260185Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:50.0260517Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:50.0260828Z ) 2025-05-07T20:32:50.0261031Z else: 2025-05-07T20:32:50.0261241Z scale_ub_tensor = None 2025-05-07T20:32:50.0261492Z 2025-05-07T20:32:50.0261729Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:50.0262042Z op = silu_mul_quant 2025-05-07T20:32:50.0262302Z if compiled: 2025-05-07T20:32:50.0262555Z op = torch.compile(op) 2025-05-07T20:32:50.0262856Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:50.0263127Z 2025-05-07T20:32:50.0263322Z y_fp8, y_scale = fn() 2025-05-07T20:32:50.0263613Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:50.0263895Z 2025-05-07T20:32:50.0264235Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:50.0264574Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:50.0264863Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:50.0265180Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:50.0265543Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:50.0265850Z 2025-05-07T20:32:50.0266058Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:50.0266260Z 2025-05-07T20:32:50.0266362Z moe/activation_test.py:126: 2025-05-07T20:32:50.0266747Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.0267146Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:50.0267481Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:50.0268262Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:50.0269023Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:50.0269567Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:50.0270247Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:50.0270928Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:50.0271638Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:50.0272388Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:50.0273135Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:50.0273858Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:50.0274531Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:50.0275132Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:50.0275656Z fn() 2025-05-07T20:32:50.0276157Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:50.0276740Z self.fn.run( 2025-05-07T20:32:50.0277209Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:50.0277742Z kernel = self.compile( 2025-05-07T20:32:50.0278277Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:50.0278928Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:50.0279325Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.0279550Z 2025-05-07T20:32:50.0279772Z self = 2025-05-07T20:32:50.0280834Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:50.0282202Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f32559d24d0>} 2025-05-07T20:32:50.0283540Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:50.0284569Z context = 2025-05-07T20:32:50.0284850Z 2025-05-07T20:32:50.0285017Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:50.0285647Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:50.0286114Z module_map=module_map) 2025-05-07T20:32:50.0286484Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:50.0286837Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:50.0287104Z E ^ 2025-05-07T20:32:50.0287589Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:50.0288076Z 2025-05-07T20:32:50.0288541Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:50.0289058Z 2025-05-07T20:32:50.0289166Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:50.0289581Z self=, 2025-05-07T20:32:50.0290241Z T=128, 2025-05-07T20:32:50.0290438Z D=7168, 2025-05-07T20:32:50.0290632Z scale_ub=None, 2025-05-07T20:32:50.0290853Z contiguous=False, 2025-05-07T20:32:50.0291083Z compiled=False, 2025-05-07T20:32:50.0291290Z ) 2025-05-07T20:32:50.3911159Z self = 2025-05-07T20:32:50.3911835Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:50.3912118Z 2025-05-07T20:32:50.3912204Z @given( 2025-05-07T20:32:50.3912453Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:50.3912768Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:50.3913104Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:50.3913454Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:50.3913781Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:50.3914075Z ) 2025-05-07T20:32:50.3914432Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:50.3914883Z def test_silu_mul_quant( 2025-05-07T20:32:50.3915466Z self, 2025-05-07T20:32:50.3915699Z T: int, 2025-05-07T20:32:50.3915894Z D: int, 2025-05-07T20:32:50.3916117Z scale_ub: Optional[float], 2025-05-07T20:32:50.3916392Z contiguous: bool, 2025-05-07T20:32:50.3916633Z compiled: bool, 2025-05-07T20:32:50.3916858Z ) -> None: 2025-05-07T20:32:50.3917076Z torch.manual_seed(2025) 2025-05-07T20:32:50.3917320Z 2025-05-07T20:32:50.3917590Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:50.3917935Z 2025-05-07T20:32:50.3918136Z x_sign = torch.sign(x) 2025-05-07T20:32:50.3918427Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:50.3918739Z x = x_sign * x_clamp 2025-05-07T20:32:50.3918990Z x0 = x[:, :D] 2025-05-07T20:32:50.3919203Z x1 = x[:, D:] 2025-05-07T20:32:50.3919414Z 2025-05-07T20:32:50.3919605Z if contiguous: 2025-05-07T20:32:50.3919842Z x0 = x0.contiguous() 2025-05-07T20:32:50.3920107Z x1 = x1.contiguous() 2025-05-07T20:32:50.3920352Z 2025-05-07T20:32:50.3920540Z if scale_ub is not None: 2025-05-07T20:32:50.3920819Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:50.3921157Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:50.3921466Z ) 2025-05-07T20:32:50.3921653Z else: 2025-05-07T20:32:50.3921867Z scale_ub_tensor = None 2025-05-07T20:32:50.3922119Z 2025-05-07T20:32:50.3922350Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:50.3922667Z op = silu_mul_quant 2025-05-07T20:32:50.3922920Z if compiled: 2025-05-07T20:32:50.3923167Z op = torch.compile(op) 2025-05-07T20:32:50.3923469Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:50.3923738Z 2025-05-07T20:32:50.3923927Z > y_fp8, y_scale = fn() 2025-05-07T20:32:50.3924191Z 2025-05-07T20:32:50.3924296Z moe/activation_test.py:117: 2025-05-07T20:32:50.3924595Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.3924922Z moe/activation_test.py:115: in fn 2025-05-07T20:32:50.3925208Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:50.3925899Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:50.3926583Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:50.3927114Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:50.3927947Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:50.3928610Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:50.3929139Z kernel = self.compile( 2025-05-07T20:32:50.3929676Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:50.3930342Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:50.3930735Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.3930959Z 2025-05-07T20:32:50.3931164Z self = 2025-05-07T20:32:50.3932232Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:50.3933617Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3255a2e830>} 2025-05-07T20:32:50.3934998Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:50.3936084Z context = 2025-05-07T20:32:50.3936372Z 2025-05-07T20:32:50.3936541Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:50.3937066Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:50.3937538Z module_map=module_map) 2025-05-07T20:32:50.3937913Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:50.3938261Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:50.3938527Z E ^ 2025-05-07T20:32:50.3938994Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:50.3939447Z 2025-05-07T20:32:50.3939975Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:50.3940490Z 2025-05-07T20:32:50.3940592Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:50.3941009Z self=, 2025-05-07T20:32:50.3941414Z T=4096, 2025-05-07T20:32:50.3941597Z D=5120, 2025-05-07T20:32:50.3941793Z scale_ub=1200.0, 2025-05-07T20:32:50.3942021Z contiguous=True, 2025-05-07T20:32:50.3942235Z compiled=False, 2025-05-07T20:32:50.3942442Z ) 2025-05-07T20:32:50.3942762Z self = 2025-05-07T20:32:50.3943251Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:50.3943534Z 2025-05-07T20:32:50.3943613Z @given( 2025-05-07T20:32:50.3943845Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:50.3944153Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:50.3944465Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:50.3944939Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:50.3945298Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:50.3945678Z ) 2025-05-07T20:32:50.3946047Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:50.3946482Z def test_silu_mul_quant( 2025-05-07T20:32:50.3946722Z self, 2025-05-07T20:32:50.3946916Z T: int, 2025-05-07T20:32:50.3947115Z D: int, 2025-05-07T20:32:50.3947329Z scale_ub: Optional[float], 2025-05-07T20:32:50.3947661Z contiguous: bool, 2025-05-07T20:32:50.3947901Z compiled: bool, 2025-05-07T20:32:50.3948116Z ) -> None: 2025-05-07T20:32:50.3948403Z torch.manual_seed(2025) 2025-05-07T20:32:50.3948651Z 2025-05-07T20:32:50.3948919Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:50.3949259Z 2025-05-07T20:32:50.3949452Z x_sign = torch.sign(x) 2025-05-07T20:32:50.3949743Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:50.3950053Z x = x_sign * x_clamp 2025-05-07T20:32:50.3950294Z x0 = x[:, :D] 2025-05-07T20:32:50.3950513Z x1 = x[:, D:] 2025-05-07T20:32:50.3950712Z 2025-05-07T20:32:50.3950896Z if contiguous: 2025-05-07T20:32:50.3951138Z x0 = x0.contiguous() 2025-05-07T20:32:50.3951386Z x1 = x1.contiguous() 2025-05-07T20:32:50.3951631Z 2025-05-07T20:32:50.3951834Z if scale_ub is not None: 2025-05-07T20:32:50.3952108Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:50.3952445Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:50.3952760Z ) 2025-05-07T20:32:50.3952947Z else: 2025-05-07T20:32:50.3953163Z scale_ub_tensor = None 2025-05-07T20:32:50.3953417Z 2025-05-07T20:32:50.3953644Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:50.3953973Z op = silu_mul_quant 2025-05-07T20:32:50.3954280Z if compiled: 2025-05-07T20:32:50.3954529Z op = torch.compile(op) 2025-05-07T20:32:50.3954835Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:50.3955116Z 2025-05-07T20:32:50.3955317Z > y_fp8, y_scale = fn() 2025-05-07T20:32:50.3955489Z 2025-05-07T20:32:50.3955590Z moe/activation_test.py:117: 2025-05-07T20:32:50.3955893Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.3956229Z moe/activation_test.py:115: in fn 2025-05-07T20:32:50.3956516Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:50.3957208Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:50.3957895Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:50.3958435Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:50.3959119Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:50.3959779Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:50.3960308Z kernel = self.compile( 2025-05-07T20:32:50.3960839Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:50.3961493Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:50.3961889Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.3962116Z 2025-05-07T20:32:50.3962329Z self = 2025-05-07T20:32:50.3963405Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:50.3964817Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3255a2e050>} 2025-05-07T20:32:50.3966150Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:50.3967164Z context = 2025-05-07T20:32:50.3967496Z 2025-05-07T20:32:50.3967670Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:50.3968221Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:50.3968693Z module_map=module_map) 2025-05-07T20:32:50.3969061Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:50.3969413Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:50.3969673Z E ^ 2025-05-07T20:32:50.3970140Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:50.3970593Z 2025-05-07T20:32:50.3971011Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:50.3971516Z 2025-05-07T20:32:50.3971620Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:50.3972027Z self=, 2025-05-07T20:32:50.3972432Z T=1, 2025-05-07T20:32:50.3972612Z D=5120, 2025-05-07T20:32:50.3972808Z scale_ub=None, 2025-05-07T20:32:50.3973029Z contiguous=True, 2025-05-07T20:32:50.3973245Z compiled=True, 2025-05-07T20:32:50.3973451Z ) 2025-05-07T20:32:50.9695591Z self = 2025-05-07T20:32:50.9696266Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:50.9696572Z 2025-05-07T20:32:50.9696661Z @given( 2025-05-07T20:32:50.9696895Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:50.9697210Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:50.9697529Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:50.9702738Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:50.9703091Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:50.9703377Z ) 2025-05-07T20:32:50.9703743Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:50.9704186Z def test_silu_mul_quant( 2025-05-07T20:32:50.9704437Z self, 2025-05-07T20:32:50.9704643Z T: int, 2025-05-07T20:32:50.9704849Z D: int, 2025-05-07T20:32:50.9705070Z scale_ub: Optional[float], 2025-05-07T20:32:50.9705350Z contiguous: bool, 2025-05-07T20:32:50.9705599Z compiled: bool, 2025-05-07T20:32:50.9705834Z ) -> None: 2025-05-07T20:32:50.9706050Z torch.manual_seed(2025) 2025-05-07T20:32:50.9706298Z 2025-05-07T20:32:50.9706576Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:50.9706912Z 2025-05-07T20:32:50.9707110Z x_sign = torch.sign(x) 2025-05-07T20:32:50.9707409Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:50.9707714Z x = x_sign * x_clamp 2025-05-07T20:32:50.9707959Z x0 = x[:, :D] 2025-05-07T20:32:50.9708184Z x1 = x[:, D:] 2025-05-07T20:32:50.9708391Z 2025-05-07T20:32:50.9708581Z if contiguous: 2025-05-07T20:32:50.9708819Z x0 = x0.contiguous() 2025-05-07T20:32:50.9709083Z x1 = x1.contiguous() 2025-05-07T20:32:50.9709328Z 2025-05-07T20:32:50.9709528Z if scale_ub is not None: 2025-05-07T20:32:50.9709803Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:50.9710145Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:50.9710571Z ) 2025-05-07T20:32:50.9710766Z else: 2025-05-07T20:32:50.9710976Z scale_ub_tensor = None 2025-05-07T20:32:50.9711231Z 2025-05-07T20:32:50.9711470Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:50.9711777Z op = silu_mul_quant 2025-05-07T20:32:50.9712034Z if compiled: 2025-05-07T20:32:50.9712287Z op = torch.compile(op) 2025-05-07T20:32:50.9712578Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:50.9712932Z 2025-05-07T20:32:50.9713127Z y_fp8, y_scale = fn() 2025-05-07T20:32:50.9713412Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:50.9713763Z 2025-05-07T20:32:50.9714008Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:50.9714340Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:50.9714633Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:50.9714955Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:50.9715313Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:50.9715618Z 2025-05-07T20:32:50.9715824Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:50.9716046Z 2025-05-07T20:32:50.9716176Z moe/activation_test.py:126: 2025-05-07T20:32:50.9716477Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.9716813Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:50.9717147Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:50.9717936Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:50.9718704Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:50.9719253Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:50.9719983Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:50.9720663Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:50.9721384Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:50.9722133Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:50.9722879Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:50.9723608Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:50.9724244Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:50.9724843Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:50.9725362Z fn() 2025-05-07T20:32:50.9725897Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:50.9726502Z self.fn.run( 2025-05-07T20:32:50.9726970Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:50.9727496Z kernel = self.compile( 2025-05-07T20:32:50.9728031Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:50.9728691Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:50.9729087Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.9729311Z 2025-05-07T20:32:50.9729522Z self = 2025-05-07T20:32:50.9730588Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:50.9731987Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3255a2f250>} 2025-05-07T20:32:50.9733315Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:50.9734384Z context = 2025-05-07T20:32:50.9734667Z 2025-05-07T20:32:50.9734874Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:50.9735397Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:50.9735877Z module_map=module_map) 2025-05-07T20:32:50.9736293Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:50.9736660Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:50.9736919Z E ^ 2025-05-07T20:32:50.9737396Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:50.9737838Z 2025-05-07T20:32:50.9738259Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:50.9738774Z 2025-05-07T20:32:50.9738890Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:50.9739296Z self=, 2025-05-07T20:32:50.9739695Z T=2048, 2025-05-07T20:32:50.9739979Z D=5120, 2025-05-07T20:32:50.9740170Z scale_ub=None, 2025-05-07T20:32:50.9740384Z contiguous=True, 2025-05-07T20:32:50.9740607Z compiled=True, 2025-05-07T20:32:50.9740807Z ) 2025-05-07T20:32:51.5060274Z self = 2025-05-07T20:32:51.5060825Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:51.5061100Z 2025-05-07T20:32:51.5061180Z @given( 2025-05-07T20:32:51.5061423Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.5061740Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.5062056Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.5062395Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.5062731Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.5063025Z ) 2025-05-07T20:32:51.5063388Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.5063829Z def test_silu_mul_quant( 2025-05-07T20:32:51.5064077Z self, 2025-05-07T20:32:51.5064279Z T: int, 2025-05-07T20:32:51.5064478Z D: int, 2025-05-07T20:32:51.5064708Z scale_ub: Optional[float], 2025-05-07T20:32:51.5064992Z contiguous: bool, 2025-05-07T20:32:51.5065242Z compiled: bool, 2025-05-07T20:32:51.5065469Z ) -> None: 2025-05-07T20:32:51.5065694Z torch.manual_seed(2025) 2025-05-07T20:32:51.5065942Z 2025-05-07T20:32:51.5066217Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.5066564Z 2025-05-07T20:32:51.5066764Z x_sign = torch.sign(x) 2025-05-07T20:32:51.5067058Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.5067372Z x = x_sign * x_clamp 2025-05-07T20:32:51.5067622Z x0 = x[:, :D] 2025-05-07T20:32:51.5067840Z x1 = x[:, D:] 2025-05-07T20:32:51.5068054Z 2025-05-07T20:32:51.5068248Z if contiguous: 2025-05-07T20:32:51.5068482Z x0 = x0.contiguous() 2025-05-07T20:32:51.5068744Z x1 = x1.contiguous() 2025-05-07T20:32:51.5068987Z 2025-05-07T20:32:51.5069181Z if scale_ub is not None: 2025-05-07T20:32:51.5069544Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:51.5069885Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:51.5070196Z ) 2025-05-07T20:32:51.5070386Z else: 2025-05-07T20:32:51.5070606Z scale_ub_tensor = None 2025-05-07T20:32:51.5070861Z 2025-05-07T20:32:51.5071094Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.5071413Z op = silu_mul_quant 2025-05-07T20:32:51.5071671Z if compiled: 2025-05-07T20:32:51.5071920Z op = torch.compile(op) 2025-05-07T20:32:51.5072317Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.5072600Z 2025-05-07T20:32:51.5072884Z y_fp8, y_scale = fn() 2025-05-07T20:32:51.5073184Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:51.5073474Z 2025-05-07T20:32:51.5073710Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.5074048Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:51.5074353Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:51.5074674Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:51.5075032Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:51.5075349Z 2025-05-07T20:32:51.5075564Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:51.5075760Z 2025-05-07T20:32:51.5075864Z moe/activation_test.py:126: 2025-05-07T20:32:51.5076169Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.5076511Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:51.5076838Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:51.5077637Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:51.5078389Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:51.5078987Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:51.5079667Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:51.5080356Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:51.5081079Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:51.5081833Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:51.5082581Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:51.5083307Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:51.5083948Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:51.5084548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:51.5085070Z fn() 2025-05-07T20:32:51.5085586Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:51.5086165Z self.fn.run( 2025-05-07T20:32:51.5086631Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:51.5087162Z kernel = self.compile( 2025-05-07T20:32:51.5087711Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:51.5088372Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:51.5088772Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.5089007Z 2025-05-07T20:32:51.5089218Z self = 2025-05-07T20:32:51.5090507Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:51.5091870Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f32554ebbe0>} 2025-05-07T20:32:51.5093199Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:51.5094357Z context = 2025-05-07T20:32:51.5094651Z 2025-05-07T20:32:51.5094820Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:51.5095339Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:51.5095805Z module_map=module_map) 2025-05-07T20:32:51.5096170Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:51.5096526Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:51.5096789Z E ^ 2025-05-07T20:32:51.5097251Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:51.5097697Z 2025-05-07T20:32:51.5098109Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:51.5098627Z 2025-05-07T20:32:51.5098735Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.5099152Z self=, 2025-05-07T20:32:51.5099557Z T=128, 2025-05-07T20:32:51.5099829Z D=5120, 2025-05-07T20:32:51.5100031Z scale_ub=None, 2025-05-07T20:32:51.5100247Z contiguous=True, 2025-05-07T20:32:51.5100536Z compiled=True, 2025-05-07T20:32:51.5100748Z ) 2025-05-07T20:32:52.4087885Z self = 2025-05-07T20:32:52.4088510Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:52.4088871Z 2025-05-07T20:32:52.4088991Z @given( 2025-05-07T20:32:52.4089230Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.4089549Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.4090129Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.4090499Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.4090849Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.4091143Z ) 2025-05-07T20:32:52.4091498Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.4091935Z def test_silu_mul_quant( 2025-05-07T20:32:52.4092187Z self, 2025-05-07T20:32:52.4092397Z T: int, 2025-05-07T20:32:52.4092599Z D: int, 2025-05-07T20:32:52.4092826Z scale_ub: Optional[float], 2025-05-07T20:32:52.4093104Z contiguous: bool, 2025-05-07T20:32:52.4093343Z compiled: bool, 2025-05-07T20:32:52.4093579Z ) -> None: 2025-05-07T20:32:52.4093800Z torch.manual_seed(2025) 2025-05-07T20:32:52.4094041Z 2025-05-07T20:32:52.4094324Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.4094670Z 2025-05-07T20:32:52.4094863Z x_sign = torch.sign(x) 2025-05-07T20:32:52.4095164Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:52.4095478Z x = x_sign * x_clamp 2025-05-07T20:32:52.4095716Z x0 = x[:, :D] 2025-05-07T20:32:52.4095935Z x1 = x[:, D:] 2025-05-07T20:32:52.4096175Z 2025-05-07T20:32:52.4096390Z if contiguous: 2025-05-07T20:32:52.4096625Z x0 = x0.contiguous() 2025-05-07T20:32:52.4096891Z x1 = x1.contiguous() 2025-05-07T20:32:52.4097454Z 2025-05-07T20:32:52.4097648Z if scale_ub is not None: 2025-05-07T20:32:52.4097926Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:52.4098264Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:52.4098564Z ) 2025-05-07T20:32:52.4098762Z else: 2025-05-07T20:32:52.4098977Z scale_ub_tensor = None 2025-05-07T20:32:52.4099224Z 2025-05-07T20:32:52.4099459Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:52.4099841Z op = silu_mul_quant 2025-05-07T20:32:52.4100190Z if compiled: 2025-05-07T20:32:52.4100443Z op = torch.compile(op) 2025-05-07T20:32:52.4100821Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.4101098Z 2025-05-07T20:32:52.4101298Z y_fp8, y_scale = fn() 2025-05-07T20:32:52.4101588Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:52.4101877Z 2025-05-07T20:32:52.4102120Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:52.4102455Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:52.4102752Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:52.4103061Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:52.4103428Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:52.4103739Z 2025-05-07T20:32:52.4103942Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:52.4104142Z 2025-05-07T20:32:52.4104250Z moe/activation_test.py:126: 2025-05-07T20:32:52.4104556Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.4104895Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:52.4105217Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:52.4106010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:52.4106852Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:52.4107401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:52.4108087Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:52.4108775Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:52.4109495Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:52.4110246Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:52.4110995Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:52.4111724Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:52.4112367Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:52.4112962Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:52.4113481Z fn() 2025-05-07T20:32:52.4113991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:52.4114565Z self.fn.run( 2025-05-07T20:32:52.4115034Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:52.4115565Z kernel = self.compile( 2025-05-07T20:32:52.4116138Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:52.4116813Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:52.4117208Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.4117486Z 2025-05-07T20:32:52.4117700Z self = 2025-05-07T20:32:52.4118773Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:52.4120152Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3254f24280>} 2025-05-07T20:32:52.4121821Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:52.4122859Z context = 2025-05-07T20:32:52.4123145Z 2025-05-07T20:32:52.4123316Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:52.4123835Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:52.4124303Z module_map=module_map) 2025-05-07T20:32:52.4124672Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:52.4125025Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:52.4125289Z E ^ 2025-05-07T20:32:52.4125753Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:52.4126200Z 2025-05-07T20:32:52.4126623Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:52.4127132Z 2025-05-07T20:32:52.4127237Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.4127648Z self=, 2025-05-07T20:32:52.4128055Z T=4096, 2025-05-07T20:32:52.4128250Z D=5120, 2025-05-07T20:32:52.4128488Z scale_ub=None, 2025-05-07T20:32:52.4128706Z contiguous=True, 2025-05-07T20:32:52.4128935Z compiled=True, 2025-05-07T20:32:52.4129138Z ) 2025-05-07T20:32:53.1372832Z self = 2025-05-07T20:32:53.1373528Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:53.1373794Z 2025-05-07T20:32:53.1373883Z @given( 2025-05-07T20:32:53.1374111Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:53.1374448Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:53.1374757Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:53.1375099Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:53.1375427Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:53.1375713Z ) 2025-05-07T20:32:53.1376060Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:53.1376553Z def test_silu_mul_quant( 2025-05-07T20:32:53.1376809Z self, 2025-05-07T20:32:53.1377004Z T: int, 2025-05-07T20:32:53.1377198Z D: int, 2025-05-07T20:32:53.1377420Z scale_ub: Optional[float], 2025-05-07T20:32:53.1377694Z contiguous: bool, 2025-05-07T20:32:53.1377930Z compiled: bool, 2025-05-07T20:32:53.1378161Z ) -> None: 2025-05-07T20:32:53.1378379Z torch.manual_seed(2025) 2025-05-07T20:32:53.1378619Z 2025-05-07T20:32:53.1378897Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:53.1379247Z 2025-05-07T20:32:53.1379438Z x_sign = torch.sign(x) 2025-05-07T20:32:53.1379733Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:53.1380202Z x = x_sign * x_clamp 2025-05-07T20:32:53.1380447Z x0 = x[:, :D] 2025-05-07T20:32:53.1380658Z x1 = x[:, D:] 2025-05-07T20:32:53.1380871Z 2025-05-07T20:32:53.1381061Z if contiguous: 2025-05-07T20:32:53.1381612Z x0 = x0.contiguous() 2025-05-07T20:32:53.1381875Z x1 = x1.contiguous() 2025-05-07T20:32:53.1382116Z 2025-05-07T20:32:53.1382305Z if scale_ub is not None: 2025-05-07T20:32:53.1382577Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:53.1382914Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:53.1383218Z ) 2025-05-07T20:32:53.1383413Z else: 2025-05-07T20:32:53.1383627Z scale_ub_tensor = None 2025-05-07T20:32:53.1383874Z 2025-05-07T20:32:53.1384205Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:53.1384515Z op = silu_mul_quant 2025-05-07T20:32:53.1385396Z if compiled: 2025-05-07T20:32:53.1385653Z op = torch.compile(op) 2025-05-07T20:32:53.1385949Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:53.1386225Z 2025-05-07T20:32:53.1386414Z y_fp8, y_scale = fn() 2025-05-07T20:32:53.1386712Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:53.1387006Z 2025-05-07T20:32:53.1387237Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:53.1387569Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:53.1387866Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:53.1388175Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:53.1388539Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:53.1388845Z 2025-05-07T20:32:53.1389043Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:53.1389249Z 2025-05-07T20:32:53.1389349Z moe/activation_test.py:126: 2025-05-07T20:32:53.1389655Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:53.1390286Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:53.1390610Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:53.1391486Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:53.1392244Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:53.1392781Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:53.1393460Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:53.1394154Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:53.1394887Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:53.1395636Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:53.1396393Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:53.1397127Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:53.1397781Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:53.1398382Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:53.1398909Z fn() 2025-05-07T20:32:53.1399422Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:53.1399997Z self.fn.run( 2025-05-07T20:32:53.1400470Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:53.1406814Z kernel = self.compile( 2025-05-07T20:32:53.1407403Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:53.1408070Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:53.1408599Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:53.1408831Z 2025-05-07T20:32:53.1409054Z self = 2025-05-07T20:32:53.1410141Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:53.1411517Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3254f252d0>} 2025-05-07T20:32:53.1413007Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:53.1414052Z context = 2025-05-07T20:32:53.1414345Z 2025-05-07T20:32:53.1414530Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:53.1415058Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:53.1415537Z module_map=module_map) 2025-05-07T20:32:53.1415915Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:53.1416307Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:53.1416605Z E ^ 2025-05-07T20:32:53.1417082Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:53.1417533Z 2025-05-07T20:32:53.1417962Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:53.1418474Z 2025-05-07T20:32:53.1418585Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:53.1419054Z self=, 2025-05-07T20:32:53.1419474Z T=16384, 2025-05-07T20:32:53.1419689Z D=5120, 2025-05-07T20:32:53.1420003Z scale_ub=None, 2025-05-07T20:32:53.1420235Z contiguous=True, 2025-05-07T20:32:53.1420478Z compiled=True, 2025-05-07T20:32:53.1420691Z ) 2025-05-07T20:32:53.1800143Z W0507 20:32:53.178000 88454 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] torch._dynamo hit config.recompile_limit (8) 2025-05-07T20:32:53.1801423Z W0507 20:32:53.178000 88454 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55) 2025-05-07T20:32:53.1802779Z W0507 20:32:53.178000 88454 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] last reason: 0/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240 2025-05-07T20:32:53.1803764Z W0507 20:32:53.178000 88454 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To log all recompilation reasons, use TORCH_LOGS="recompiles". 2025-05-07T20:32:53.1804866Z W0507 20:32:53.178000 88454 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html. 2025-05-07T20:32:53.2821219Z self = 2025-05-07T20:32:53.2821832Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:53.2822114Z 2025-05-07T20:32:53.2822206Z @given( 2025-05-07T20:32:53.2822460Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:53.2822782Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:53.2823104Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:53.2823449Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:53.2823782Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:53.2824081Z ) 2025-05-07T20:32:53.2824710Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:53.2825151Z def test_silu_mul_quant( 2025-05-07T20:32:53.2825407Z self, 2025-05-07T20:32:53.2825616Z T: int, 2025-05-07T20:32:53.2825816Z D: int, 2025-05-07T20:32:53.2826048Z scale_ub: Optional[float], 2025-05-07T20:32:53.2826361Z contiguous: bool, 2025-05-07T20:32:53.2826628Z compiled: bool, 2025-05-07T20:32:53.2826872Z ) -> None: 2025-05-07T20:32:53.2827109Z torch.manual_seed(2025) 2025-05-07T20:32:53.2827439Z 2025-05-07T20:32:53.2827730Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:53.2828084Z 2025-05-07T20:32:53.2828354Z x_sign = torch.sign(x) 2025-05-07T20:32:53.2828659Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:53.2828987Z x = x_sign * x_clamp 2025-05-07T20:32:53.2829241Z x0 = x[:, :D] 2025-05-07T20:32:53.2829467Z x1 = x[:, D:] 2025-05-07T20:32:53.2829689Z 2025-05-07T20:32:53.2829888Z if contiguous: 2025-05-07T20:32:53.2830124Z x0 = x0.contiguous() 2025-05-07T20:32:53.2830398Z x1 = x1.contiguous() 2025-05-07T20:32:53.2830650Z 2025-05-07T20:32:53.2830846Z if scale_ub is not None: 2025-05-07T20:32:53.2831134Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:53.2831483Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:53.2831793Z ) 2025-05-07T20:32:53.2832002Z else: 2025-05-07T20:32:53.2832231Z scale_ub_tensor = None 2025-05-07T20:32:53.2832495Z 2025-05-07T20:32:53.2832740Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:53.2833055Z op = silu_mul_quant 2025-05-07T20:32:53.2833318Z if compiled: 2025-05-07T20:32:53.2833575Z op = torch.compile(op) 2025-05-07T20:32:53.2833873Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:53.2834163Z 2025-05-07T20:32:53.2834452Z y_fp8, y_scale = fn() 2025-05-07T20:32:53.2834747Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:53.2835050Z 2025-05-07T20:32:53.2835294Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:53.2835632Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:53.2835935Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:53.2836259Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:53.2836631Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:53.2836943Z 2025-05-07T20:32:53.2837154Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:53.2837353Z 2025-05-07T20:32:53.2837462Z moe/activation_test.py:126: 2025-05-07T20:32:53.2837761Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:53.2838105Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:53.2838449Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:53.2839242Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:53.2839999Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:53.2840558Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:53.2841252Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:53.2841937Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:53.2842672Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:53.2843443Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:53.2844206Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:53.2845042Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:53.2845689Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:53.2846296Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:53.2846817Z fn() 2025-05-07T20:32:53.2847321Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:53.2847957Z self.fn.run( 2025-05-07T20:32:53.2848472Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:53.2849000Z kernel = self.compile( 2025-05-07T20:32:53.2849551Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:53.2850221Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:53.2850624Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:53.2850850Z 2025-05-07T20:32:53.2851060Z self = 2025-05-07T20:32:53.2852132Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:53.2853507Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3254f36320>} 2025-05-07T20:32:53.2854835Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:53.2855898Z context = 2025-05-07T20:32:53.2856190Z 2025-05-07T20:32:53.2856359Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:53.2856885Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:53.2857357Z module_map=module_map) 2025-05-07T20:32:53.2857723Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:53.2858087Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:53.2858365Z E ^ 2025-05-07T20:32:53.2858825Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:53.2859277Z 2025-05-07T20:32:53.2859694Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:53.2860335Z 2025-05-07T20:32:53.2860449Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:53.2860865Z self=, 2025-05-07T20:32:53.2861264Z T=1, 2025-05-07T20:32:53.2861464Z D=5120, 2025-05-07T20:32:53.2861664Z scale_ub=1200.0, 2025-05-07T20:32:53.2861885Z contiguous=True, 2025-05-07T20:32:53.2862118Z compiled=True, 2025-05-07T20:32:53.2862324Z ) 2025-05-07T20:32:53.4308507Z self = 2025-05-07T20:32:53.4309216Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:53.4309501Z 2025-05-07T20:32:53.4309580Z @given( 2025-05-07T20:32:53.4309830Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:53.4310154Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:53.4310459Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:53.4310796Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:53.4311443Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:53.4311731Z ) 2025-05-07T20:32:53.4312089Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:53.4312528Z def test_silu_mul_quant( 2025-05-07T20:32:53.4312776Z self, 2025-05-07T20:32:53.4312969Z T: int, 2025-05-07T20:32:53.4313172Z D: int, 2025-05-07T20:32:53.4313401Z scale_ub: Optional[float], 2025-05-07T20:32:53.4313668Z contiguous: bool, 2025-05-07T20:32:53.4313916Z compiled: bool, 2025-05-07T20:32:53.4314235Z ) -> None: 2025-05-07T20:32:53.4314453Z torch.manual_seed(2025) 2025-05-07T20:32:53.4314705Z 2025-05-07T20:32:53.4315064Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:53.4315401Z 2025-05-07T20:32:53.4315604Z x_sign = torch.sign(x) 2025-05-07T20:32:53.4315905Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:53.4316216Z x = x_sign * x_clamp 2025-05-07T20:32:53.4316470Z x0 = x[:, :D] 2025-05-07T20:32:53.4316694Z x1 = x[:, D:] 2025-05-07T20:32:53.4316898Z 2025-05-07T20:32:53.4317091Z if contiguous: 2025-05-07T20:32:53.4317330Z x0 = x0.contiguous() 2025-05-07T20:32:53.4317594Z x1 = x1.contiguous() 2025-05-07T20:32:53.4317835Z 2025-05-07T20:32:53.4318035Z if scale_ub is not None: 2025-05-07T20:32:53.4318315Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:53.4318651Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:53.4318979Z ) 2025-05-07T20:32:53.4319176Z else: 2025-05-07T20:32:53.4319392Z scale_ub_tensor = None 2025-05-07T20:32:53.4319656Z 2025-05-07T20:32:53.4319896Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:53.4320206Z op = silu_mul_quant 2025-05-07T20:32:53.4320468Z if compiled: 2025-05-07T20:32:53.4320730Z op = torch.compile(op) 2025-05-07T20:32:53.4321332Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:53.4321618Z 2025-05-07T20:32:53.4321820Z > y_fp8, y_scale = fn() 2025-05-07T20:32:53.4321988Z 2025-05-07T20:32:53.4322100Z moe/activation_test.py:117: 2025-05-07T20:32:53.4322392Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:53.4322733Z moe/activation_test.py:115: in fn 2025-05-07T20:32:53.4323025Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:53.4323581Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:53.4324149Z return fn(*args, **kwargs) 2025-05-07T20:32:53.4324816Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:53.4325500Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:53.4326039Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:53.4326732Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:53.4327393Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:53.4327949Z kernel = self.compile( 2025-05-07T20:32:53.4328486Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:53.4329151Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:53.4329552Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:53.4329781Z 2025-05-07T20:32:53.4329999Z self = 2025-05-07T20:32:53.4331064Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:53.4332508Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f325442a710>} 2025-05-07T20:32:53.4333854Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:53.4334917Z context = 2025-05-07T20:32:53.4335202Z 2025-05-07T20:32:53.4335413Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:53.4335944Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:53.4336424Z module_map=module_map) 2025-05-07T20:32:53.4336801Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:53.4337147Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:53.4337408Z E ^ 2025-05-07T20:32:53.4337876Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:53.4338320Z 2025-05-07T20:32:53.4338731Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:53.4339244Z 2025-05-07T20:32:53.4339350Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:53.4339867Z self=, 2025-05-07T20:32:53.4340275Z T=1, 2025-05-07T20:32:53.4340456Z D=5120, 2025-05-07T20:32:53.4340650Z scale_ub=None, 2025-05-07T20:32:53.4340873Z contiguous=False, 2025-05-07T20:32:53.4341094Z compiled=True, 2025-05-07T20:32:53.4341305Z ) 2025-05-07T20:32:53.5016427Z self = 2025-05-07T20:32:53.5017327Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:53.5017593Z 2025-05-07T20:32:53.5017671Z @given( 2025-05-07T20:32:53.5017905Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:53.5018216Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:53.5018523Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:53.5018850Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:53.5019180Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:53.5019472Z ) 2025-05-07T20:32:53.5019900Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:53.5020348Z def test_silu_mul_quant( 2025-05-07T20:32:53.5020592Z self, 2025-05-07T20:32:53.5020780Z T: int, 2025-05-07T20:32:53.5020982Z D: int, 2025-05-07T20:32:53.5021206Z scale_ub: Optional[float], 2025-05-07T20:32:53.5021475Z contiguous: bool, 2025-05-07T20:32:53.5021719Z compiled: bool, 2025-05-07T20:32:53.5021951Z ) -> None: 2025-05-07T20:32:53.5022167Z torch.manual_seed(2025) 2025-05-07T20:32:53.5022408Z 2025-05-07T20:32:53.5022686Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:53.5023037Z 2025-05-07T20:32:53.5023228Z x_sign = torch.sign(x) 2025-05-07T20:32:53.5023522Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:53.5023833Z x = x_sign * x_clamp 2025-05-07T20:32:53.5024074Z x0 = x[:, :D] 2025-05-07T20:32:53.5024296Z x1 = x[:, D:] 2025-05-07T20:32:53.5024504Z 2025-05-07T20:32:53.5024690Z if contiguous: 2025-05-07T20:32:53.5024925Z x0 = x0.contiguous() 2025-05-07T20:32:53.5025185Z x1 = x1.contiguous() 2025-05-07T20:32:53.5025418Z 2025-05-07T20:32:53.5025614Z if scale_ub is not None: 2025-05-07T20:32:53.5025890Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:53.5026324Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:53.5026680Z ) 2025-05-07T20:32:53.5026878Z else: 2025-05-07T20:32:53.5027088Z scale_ub_tensor = None 2025-05-07T20:32:53.5027342Z 2025-05-07T20:32:53.5027574Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:53.5027884Z op = silu_mul_quant 2025-05-07T20:32:53.5028136Z if compiled: 2025-05-07T20:32:53.5028386Z op = torch.compile(op) 2025-05-07T20:32:53.5028756Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:53.5029025Z 2025-05-07T20:32:53.5029219Z y_fp8, y_scale = fn() 2025-05-07T20:32:53.5029575Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:53.5029862Z 2025-05-07T20:32:53.5030100Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:53.5030433Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:53.5030725Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:53.5031041Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:53.5031397Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:53.5031699Z 2025-05-07T20:32:53.5031903Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:53.5032100Z 2025-05-07T20:32:53.5032201Z moe/activation_test.py:126: 2025-05-07T20:32:53.5032504Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:53.5032833Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:53.5033156Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:53.5033936Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:53.5034677Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:53.5035298Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:53.5035979Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:53.5036659Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:53.5037375Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:53.5038121Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:53.5038865Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:53.5039588Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:53.5040215Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:53.5040816Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:53.5041334Z fn() 2025-05-07T20:32:53.5041834Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:53.5042417Z self.fn.run( 2025-05-07T20:32:53.5042888Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:53.5043423Z kernel = self.compile( 2025-05-07T20:32:53.5043956Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:53.5044614Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:53.5045010Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:53.5045235Z 2025-05-07T20:32:53.5045448Z self = 2025-05-07T20:32:53.5046515Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:53.5047942Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3254f34a60>} 2025-05-07T20:32:53.5049269Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:53.5050377Z context = 2025-05-07T20:32:53.5050660Z 2025-05-07T20:32:53.5050827Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:53.5051347Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:53.5051818Z module_map=module_map) 2025-05-07T20:32:53.5052185Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:53.5052533Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:53.5052797Z E ^ 2025-05-07T20:32:53.5053259Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:53.5053699Z 2025-05-07T20:32:53.5054115Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:53.5054629Z 2025-05-07T20:32:53.5054733Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:53.5055149Z self=, 2025-05-07T20:32:53.5055549Z T=1, 2025-05-07T20:32:53.5055725Z D=5120, 2025-05-07T20:32:53.5055918Z scale_ub=None, 2025-05-07T20:32:53.5056136Z contiguous=True, 2025-05-07T20:32:53.5056357Z compiled=False, 2025-05-07T20:32:53.5056612Z ) 2025-05-07T20:32:53.8363444Z self = 2025-05-07T20:32:53.8364390Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:53.8364793Z 2025-05-07T20:32:53.8364919Z @given( 2025-05-07T20:32:53.8365259Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:53.8365729Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:53.8366185Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:53.8366599Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:53.8366935Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:53.8367234Z ) 2025-05-07T20:32:53.8367581Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:53.8368021Z def test_silu_mul_quant( 2025-05-07T20:32:53.8368268Z self, 2025-05-07T20:32:53.8368457Z T: int, 2025-05-07T20:32:53.8368663Z D: int, 2025-05-07T20:32:53.8368894Z scale_ub: Optional[float], 2025-05-07T20:32:53.8369161Z contiguous: bool, 2025-05-07T20:32:53.8369411Z compiled: bool, 2025-05-07T20:32:53.8369643Z ) -> None: 2025-05-07T20:32:53.8369860Z torch.manual_seed(2025) 2025-05-07T20:32:53.8370101Z 2025-05-07T20:32:53.8370374Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:53.8370711Z 2025-05-07T20:32:53.8370901Z x_sign = torch.sign(x) 2025-05-07T20:32:53.8371199Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:53.8371511Z x = x_sign * x_clamp 2025-05-07T20:32:53.8371747Z x0 = x[:, :D] 2025-05-07T20:32:53.8371965Z x1 = x[:, D:] 2025-05-07T20:32:53.8372176Z 2025-05-07T20:32:53.8372353Z if contiguous: 2025-05-07T20:32:53.8372585Z x0 = x0.contiguous() 2025-05-07T20:32:53.8372841Z x1 = x1.contiguous() 2025-05-07T20:32:53.8373400Z 2025-05-07T20:32:53.8373593Z if scale_ub is not None: 2025-05-07T20:32:53.8373865Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:53.8374196Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:53.8374510Z ) 2025-05-07T20:32:53.8374698Z else: 2025-05-07T20:32:53.8374912Z scale_ub_tensor = None 2025-05-07T20:32:53.8375160Z 2025-05-07T20:32:53.8375385Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:53.8375695Z op = silu_mul_quant 2025-05-07T20:32:53.8376034Z if compiled: 2025-05-07T20:32:53.8376282Z op = torch.compile(op) 2025-05-07T20:32:53.8376703Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:53.8376980Z 2025-05-07T20:32:53.8377176Z > y_fp8, y_scale = fn() 2025-05-07T20:32:53.8377340Z 2025-05-07T20:32:53.8377440Z moe/activation_test.py:117: 2025-05-07T20:32:53.8377739Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:53.8378076Z moe/activation_test.py:115: in fn 2025-05-07T20:32:53.8378357Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:53.8379050Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:53.8379860Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:53.8380429Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:53.8387662Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:53.8388366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:53.8388912Z kernel = self.compile( 2025-05-07T20:32:53.8389468Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:53.8390683Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:53.8391100Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:53.8391363Z 2025-05-07T20:32:53.8391655Z self = 2025-05-07T20:32:53.8392944Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:53.8394355Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3255071e10>} 2025-05-07T20:32:53.8395716Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:53.8396758Z context = 2025-05-07T20:32:53.8397052Z 2025-05-07T20:32:53.8397234Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:53.8397772Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:53.8398249Z module_map=module_map) 2025-05-07T20:32:53.8398638Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:53.8399013Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:53.8399291Z E ^ 2025-05-07T20:32:53.8399774Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:53.8400237Z 2025-05-07T20:32:53.8400677Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:53.8401195Z 2025-05-07T20:32:53.8401313Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:53.8401830Z self=, 2025-05-07T20:32:53.8402244Z T=128, 2025-05-07T20:32:53.8402450Z D=5120, 2025-05-07T20:32:53.8402654Z scale_ub=None, 2025-05-07T20:32:53.8402885Z contiguous=False, 2025-05-07T20:32:53.8403126Z compiled=True, 2025-05-07T20:32:53.8403338Z ) 2025-05-07T20:32:53.8403674Z self = 2025-05-07T20:32:53.8404179Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:53.8404525Z 2025-05-07T20:32:53.8404619Z @given( 2025-05-07T20:32:53.8404910Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:53.8405233Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:53.8405558Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:53.8405895Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:53.8406242Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:53.8406579Z ) 2025-05-07T20:32:53.8406953Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:53.8407407Z def test_silu_mul_quant( 2025-05-07T20:32:53.8407667Z self, 2025-05-07T20:32:53.8407871Z T: int, 2025-05-07T20:32:53.8408084Z D: int, 2025-05-07T20:32:53.8408318Z scale_ub: Optional[float], 2025-05-07T20:32:53.8408599Z contiguous: bool, 2025-05-07T20:32:53.8408855Z compiled: bool, 2025-05-07T20:32:53.8409101Z ) -> None: 2025-05-07T20:32:53.8409327Z torch.manual_seed(2025) 2025-05-07T20:32:53.8409585Z 2025-05-07T20:32:53.8409884Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:53.8410242Z 2025-05-07T20:32:53.8410448Z x_sign = torch.sign(x) 2025-05-07T20:32:53.8410758Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:53.8411089Z x = x_sign * x_clamp 2025-05-07T20:32:53.8411395Z x0 = x[:, :D] 2025-05-07T20:32:53.8411635Z x1 = x[:, D:] 2025-05-07T20:32:53.8411866Z 2025-05-07T20:32:53.8412062Z if contiguous: 2025-05-07T20:32:53.8412311Z x0 = x0.contiguous() 2025-05-07T20:32:53.8412588Z x1 = x1.contiguous() 2025-05-07T20:32:53.8412840Z 2025-05-07T20:32:53.8413053Z if scale_ub is not None: 2025-05-07T20:32:53.8413349Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:53.8413694Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:53.8414022Z ) 2025-05-07T20:32:53.8414231Z else: 2025-05-07T20:32:53.8414454Z scale_ub_tensor = None 2025-05-07T20:32:53.8414728Z 2025-05-07T20:32:53.8414981Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:53.8415313Z op = silu_mul_quant 2025-05-07T20:32:53.8415574Z if compiled: 2025-05-07T20:32:53.8415829Z op = torch.compile(op) 2025-05-07T20:32:53.8416151Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:53.8416435Z 2025-05-07T20:32:53.8416643Z > y_fp8, y_scale = fn() 2025-05-07T20:32:53.8416814Z 2025-05-07T20:32:53.8416917Z moe/activation_test.py:117: 2025-05-07T20:32:53.8417227Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:53.8417569Z moe/activation_test.py:115: in fn 2025-05-07T20:32:53.8417856Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:53.8418423Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:53.8418994Z return fn(*args, **kwargs) 2025-05-07T20:32:53.8419654Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:53.8420497Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:53.8421046Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:53.8421793Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:53.8422453Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:53.8422995Z kernel = self.compile( 2025-05-07T20:32:53.8423547Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:53.8424217Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:53.8424663Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:53.8424902Z 2025-05-07T20:32:53.8425185Z self = 2025-05-07T20:32:53.8426275Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:53.8427647Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3255070310>} 2025-05-07T20:32:53.8428980Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:53.8430009Z context = 2025-05-07T20:32:53.8430307Z 2025-05-07T20:32:53.8430480Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:53.8431010Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:53.8431479Z module_map=module_map) 2025-05-07T20:32:53.8431852Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:53.8432262Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:53.8432526Z E ^ 2025-05-07T20:32:53.8433101Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:53.8433589Z 2025-05-07T20:32:53.8434015Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:53.8434528Z 2025-05-07T20:32:53.8434639Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:53.8435056Z self=, 2025-05-07T20:32:53.8435461Z T=128, 2025-05-07T20:32:53.8435658Z D=7168, 2025-05-07T20:32:53.8435856Z scale_ub=1200.0, 2025-05-07T20:32:53.8436090Z contiguous=False, 2025-05-07T20:32:53.8436323Z compiled=False, 2025-05-07T20:32:53.8436532Z ) 2025-05-07T20:32:53.9697884Z self = 2025-05-07T20:32:53.9698661Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:53.9699039Z 2025-05-07T20:32:53.9699158Z @given( 2025-05-07T20:32:53.9699440Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:53.9699835Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:53.9700141Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:53.9700476Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:53.9700809Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:53.9701103Z ) 2025-05-07T20:32:53.9701458Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:53.9701910Z def test_silu_mul_quant( 2025-05-07T20:32:53.9702159Z self, 2025-05-07T20:32:53.9702353Z T: int, 2025-05-07T20:32:53.9702553Z D: int, 2025-05-07T20:32:53.9702783Z scale_ub: Optional[float], 2025-05-07T20:32:53.9703350Z contiguous: bool, 2025-05-07T20:32:53.9703600Z compiled: bool, 2025-05-07T20:32:53.9703831Z ) -> None: 2025-05-07T20:32:53.9704048Z torch.manual_seed(2025) 2025-05-07T20:32:53.9704300Z 2025-05-07T20:32:53.9704581Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:53.9704916Z 2025-05-07T20:32:53.9705115Z x_sign = torch.sign(x) 2025-05-07T20:32:53.9705413Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:53.9705718Z x = x_sign * x_clamp 2025-05-07T20:32:53.9706044Z x0 = x[:, :D] 2025-05-07T20:32:53.9706270Z x1 = x[:, D:] 2025-05-07T20:32:53.9706478Z 2025-05-07T20:32:53.9706668Z if contiguous: 2025-05-07T20:32:53.9706985Z x0 = x0.contiguous() 2025-05-07T20:32:53.9707244Z x1 = x1.contiguous() 2025-05-07T20:32:53.9707489Z 2025-05-07T20:32:53.9707684Z if scale_ub is not None: 2025-05-07T20:32:53.9707958Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:53.9708299Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:53.9708614Z ) 2025-05-07T20:32:53.9708809Z else: 2025-05-07T20:32:53.9709019Z scale_ub_tensor = None 2025-05-07T20:32:53.9709272Z 2025-05-07T20:32:53.9709509Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:53.9709818Z op = silu_mul_quant 2025-05-07T20:32:53.9710074Z if compiled: 2025-05-07T20:32:53.9710324Z op = torch.compile(op) 2025-05-07T20:32:53.9710625Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:53.9710903Z 2025-05-07T20:32:53.9711103Z > y_fp8, y_scale = fn() 2025-05-07T20:32:53.9711273Z 2025-05-07T20:32:53.9711374Z moe/activation_test.py:117: 2025-05-07T20:32:53.9711673Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:53.9712006Z moe/activation_test.py:115: in fn 2025-05-07T20:32:53.9712292Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:53.9713050Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:53.9713747Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:53.9714282Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:53.9714953Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:53.9715613Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:53.9716147Z kernel = self.compile( 2025-05-07T20:32:53.9716691Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:53.9717338Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:53.9717735Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:53.9717969Z 2025-05-07T20:32:53.9718184Z self = 2025-05-07T20:32:53.9719256Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:53.9720644Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3254f35900>} 2025-05-07T20:32:53.9721983Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:53.9723018Z context = 2025-05-07T20:32:53.9723354Z 2025-05-07T20:32:53.9723530Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:53.9724052Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:53.9724530Z module_map=module_map) 2025-05-07T20:32:53.9724904Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:53.9725259Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:53.9725517Z E ^ 2025-05-07T20:32:53.9725987Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:53.9726478Z 2025-05-07T20:32:53.9726942Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:53.9727458Z 2025-05-07T20:32:53.9727572Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:53.9727979Z self=, 2025-05-07T20:32:53.9728392Z T=128, 2025-05-07T20:32:53.9728592Z D=5120, 2025-05-07T20:32:53.9728783Z scale_ub=None, 2025-05-07T20:32:53.9729022Z contiguous=False, 2025-05-07T20:32:53.9729255Z compiled=False, 2025-05-07T20:32:53.9729467Z ) 2025-05-07T20:32:53.9729785Z self = 2025-05-07T20:32:53.9730292Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:53.9730559Z 2025-05-07T20:32:53.9730652Z @given( 2025-05-07T20:32:53.9730888Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:53.9731206Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:53.9731522Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:53.9731845Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:53.9732180Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:53.9732471Z ) 2025-05-07T20:32:53.9732875Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:53.9733316Z def test_silu_mul_quant( 2025-05-07T20:32:53.9733566Z self, 2025-05-07T20:32:53.9733764Z T: int, 2025-05-07T20:32:53.9733961Z D: int, 2025-05-07T20:32:53.9734191Z scale_ub: Optional[float], 2025-05-07T20:32:53.9734473Z contiguous: bool, 2025-05-07T20:32:53.9734709Z compiled: bool, 2025-05-07T20:32:53.9734936Z ) -> None: 2025-05-07T20:32:53.9735156Z torch.manual_seed(2025) 2025-05-07T20:32:53.9735393Z 2025-05-07T20:32:53.9735674Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:53.9736014Z 2025-05-07T20:32:53.9736210Z x_sign = torch.sign(x) 2025-05-07T20:32:53.9736537Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:53.9736888Z x = x_sign * x_clamp 2025-05-07T20:32:53.9737127Z x0 = x[:, :D] 2025-05-07T20:32:53.9737348Z x1 = x[:, D:] 2025-05-07T20:32:53.9737564Z 2025-05-07T20:32:53.9737748Z if contiguous: 2025-05-07T20:32:53.9737990Z x0 = x0.contiguous() 2025-05-07T20:32:53.9738252Z x1 = x1.contiguous() 2025-05-07T20:32:53.9738496Z 2025-05-07T20:32:53.9738688Z if scale_ub is not None: 2025-05-07T20:32:53.9738960Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:53.9739294Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:53.9739598Z ) 2025-05-07T20:32:53.9739854Z else: 2025-05-07T20:32:53.9740074Z scale_ub_tensor = None 2025-05-07T20:32:53.9740323Z 2025-05-07T20:32:53.9740562Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:53.9740882Z op = silu_mul_quant 2025-05-07T20:32:53.9741136Z if compiled: 2025-05-07T20:32:53.9741389Z op = torch.compile(op) 2025-05-07T20:32:53.9741690Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:53.9741961Z 2025-05-07T20:32:53.9742217Z > y_fp8, y_scale = fn() 2025-05-07T20:32:53.9742394Z 2025-05-07T20:32:53.9742495Z moe/activation_test.py:117: 2025-05-07T20:32:53.9742801Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:53.9743128Z moe/activation_test.py:115: in fn 2025-05-07T20:32:53.9743422Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:53.9744109Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:53.9744790Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:53.9745422Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:53.9746121Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:53.9746786Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:53.9747317Z kernel = self.compile( 2025-05-07T20:32:53.9747862Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:53.9748517Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:53.9748908Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:53.9749145Z 2025-05-07T20:32:53.9749352Z self = 2025-05-07T20:32:53.9750427Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:53.9751813Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3254cf2a70>} 2025-05-07T20:32:53.9753189Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:53.9754217Z context = 2025-05-07T20:32:53.9754507Z 2025-05-07T20:32:53.9754673Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:53.9755195Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:53.9755663Z module_map=module_map) 2025-05-07T20:32:53.9756025Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:53.9756385Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:53.9756645Z E ^ 2025-05-07T20:32:53.9757101Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:53.9757552Z 2025-05-07T20:32:53.9757976Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:53.9758487Z 2025-05-07T20:32:53.9758592Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:53.9759007Z self=, 2025-05-07T20:32:53.9759403Z T=128, 2025-05-07T20:32:53.9759593Z D=5120, 2025-05-07T20:32:53.9759791Z scale_ub=1200.0, 2025-05-07T20:32:53.9760013Z contiguous=True, 2025-05-07T20:32:53.9760241Z compiled=False, 2025-05-07T20:32:53.9760450Z ) 2025-05-07T20:32:54.1686836Z self = 2025-05-07T20:32:54.1687614Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:54.1687993Z 2025-05-07T20:32:54.1688101Z @given( 2025-05-07T20:32:54.1688399Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.1688809Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.1689480Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.1690291Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.1690644Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.1690939Z ) 2025-05-07T20:32:54.1691292Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.1691734Z def test_silu_mul_quant( 2025-05-07T20:32:54.1691982Z self, 2025-05-07T20:32:54.1692179Z T: int, 2025-05-07T20:32:54.1692386Z D: int, 2025-05-07T20:32:54.1692761Z scale_ub: Optional[float], 2025-05-07T20:32:54.1693044Z contiguous: bool, 2025-05-07T20:32:54.1693354Z compiled: bool, 2025-05-07T20:32:54.1693587Z ) -> None: 2025-05-07T20:32:54.1693807Z torch.manual_seed(2025) 2025-05-07T20:32:54.1694050Z 2025-05-07T20:32:54.1694331Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.1694679Z 2025-05-07T20:32:54.1694875Z x_sign = torch.sign(x) 2025-05-07T20:32:54.1695170Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.1695486Z x = x_sign * x_clamp 2025-05-07T20:32:54.1695729Z x0 = x[:, :D] 2025-05-07T20:32:54.1695954Z x1 = x[:, D:] 2025-05-07T20:32:54.1696166Z 2025-05-07T20:32:54.1696351Z if contiguous: 2025-05-07T20:32:54.1696590Z x0 = x0.contiguous() 2025-05-07T20:32:54.1696853Z x1 = x1.contiguous() 2025-05-07T20:32:54.1697088Z 2025-05-07T20:32:54.1697296Z if scale_ub is not None: 2025-05-07T20:32:54.1697575Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.1697921Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.1698232Z ) 2025-05-07T20:32:54.1698430Z else: 2025-05-07T20:32:54.1698650Z scale_ub_tensor = None 2025-05-07T20:32:54.1698898Z 2025-05-07T20:32:54.1699131Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.1699538Z op = silu_mul_quant 2025-05-07T20:32:54.1699895Z if compiled: 2025-05-07T20:32:54.1700156Z op = torch.compile(op) 2025-05-07T20:32:54.1700461Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.1700738Z 2025-05-07T20:32:54.1700938Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.1701105Z 2025-05-07T20:32:54.1701214Z moe/activation_test.py:117: 2025-05-07T20:32:54.1701508Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.1701849Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.1702139Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.1702835Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.1703529Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.1704072Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.1704760Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.1705416Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.1705953Z kernel = self.compile( 2025-05-07T20:32:54.1706496Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.1707146Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.1707542Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.1707776Z 2025-05-07T20:32:54.1707984Z self = 2025-05-07T20:32:54.1709058Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.1710500Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3254cf32e0>} 2025-05-07T20:32:54.1711832Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.1712897Z context = 2025-05-07T20:32:54.1713186Z 2025-05-07T20:32:54.1713391Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.1713913Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.1714382Z module_map=module_map) 2025-05-07T20:32:54.1714760Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.1715116Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.1715376Z E ^ 2025-05-07T20:32:54.1715841Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.1716291Z 2025-05-07T20:32:54.1716706Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.1717214Z 2025-05-07T20:32:54.1717329Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.1717749Z self=, 2025-05-07T20:32:54.1718147Z T=1, 2025-05-07T20:32:54.1718341Z D=7168, 2025-05-07T20:32:54.1718540Z scale_ub=1200.0, 2025-05-07T20:32:54.1718758Z contiguous=True, 2025-05-07T20:32:54.1718984Z compiled=True, 2025-05-07T20:32:54.1719192Z ) 2025-05-07T20:32:54.1719510Z self = 2025-05-07T20:32:54.1720046Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:54.1720310Z 2025-05-07T20:32:54.1720394Z @given( 2025-05-07T20:32:54.1720627Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.1720944Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.1721257Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.1721591Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.1721920Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.1722214Z ) 2025-05-07T20:32:54.1722566Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.1723006Z def test_silu_mul_quant( 2025-05-07T20:32:54.1723257Z self, 2025-05-07T20:32:54.1723461Z T: int, 2025-05-07T20:32:54.1723660Z D: int, 2025-05-07T20:32:54.1723890Z scale_ub: Optional[float], 2025-05-07T20:32:54.1724170Z contiguous: bool, 2025-05-07T20:32:54.1724411Z compiled: bool, 2025-05-07T20:32:54.1730988Z ) -> None: 2025-05-07T20:32:54.1731234Z torch.manual_seed(2025) 2025-05-07T20:32:54.1731499Z 2025-05-07T20:32:54.1731791Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.1732150Z 2025-05-07T20:32:54.1732360Z x_sign = torch.sign(x) 2025-05-07T20:32:54.1732657Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.1732982Z x = x_sign * x_clamp 2025-05-07T20:32:54.1733249Z x0 = x[:, :D] 2025-05-07T20:32:54.1733481Z x1 = x[:, D:] 2025-05-07T20:32:54.1733697Z 2025-05-07T20:32:54.1733902Z if contiguous: 2025-05-07T20:32:54.1734150Z x0 = x0.contiguous() 2025-05-07T20:32:54.1734415Z x1 = x1.contiguous() 2025-05-07T20:32:54.1734669Z 2025-05-07T20:32:54.1734873Z if scale_ub is not None: 2025-05-07T20:32:54.1735155Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.1735591Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.1735912Z ) 2025-05-07T20:32:54.1736113Z else: 2025-05-07T20:32:54.1736342Z scale_ub_tensor = None 2025-05-07T20:32:54.1736604Z 2025-05-07T20:32:54.1736844Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.1737172Z op = silu_mul_quant 2025-05-07T20:32:54.1737437Z if compiled: 2025-05-07T20:32:54.1737693Z op = torch.compile(op) 2025-05-07T20:32:54.1738056Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.1738340Z 2025-05-07T20:32:54.1738540Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.1738762Z 2025-05-07T20:32:54.1738870Z moe/activation_test.py:117: 2025-05-07T20:32:54.1739180Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.1739525Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.1739888Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.1740467Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:54.1741045Z return fn(*args, **kwargs) 2025-05-07T20:32:54.1741712Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.1742407Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.1742961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.1743652Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.1744321Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.1744861Z kernel = self.compile( 2025-05-07T20:32:54.1745410Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.1746130Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.1746536Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.1746772Z 2025-05-07T20:32:54.1746981Z self = 2025-05-07T20:32:54.1748060Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.1749435Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3254cf30a0>} 2025-05-07T20:32:54.1750773Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.1751819Z context = 2025-05-07T20:32:54.1752116Z 2025-05-07T20:32:54.1752291Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.1752823Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.1753292Z module_map=module_map) 2025-05-07T20:32:54.1753675Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.1754041Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.1754305Z E ^ 2025-05-07T20:32:54.1754783Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.1755239Z 2025-05-07T20:32:54.1755658Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.1756228Z 2025-05-07T20:32:54.1756346Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.1756767Z self=, 2025-05-07T20:32:54.1757178Z T=1, 2025-05-07T20:32:54.1757379Z D=7168, 2025-05-07T20:32:54.1757580Z scale_ub=1200.0, 2025-05-07T20:32:54.1757819Z contiguous=False, 2025-05-07T20:32:54.1758058Z compiled=True, 2025-05-07T20:32:54.1758278Z ) 2025-05-07T20:32:54.3140144Z self = 2025-05-07T20:32:54.3141204Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:54.3141497Z 2025-05-07T20:32:54.3141687Z @given( 2025-05-07T20:32:54.3141928Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.3142252Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.3142554Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.3142886Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.3143236Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.3143530Z ) 2025-05-07T20:32:54.3143878Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.3144321Z def test_silu_mul_quant( 2025-05-07T20:32:54.3144566Z self, 2025-05-07T20:32:54.3144761Z T: int, 2025-05-07T20:32:54.3144964Z D: int, 2025-05-07T20:32:54.3145192Z scale_ub: Optional[float], 2025-05-07T20:32:54.3145464Z contiguous: bool, 2025-05-07T20:32:54.3145713Z compiled: bool, 2025-05-07T20:32:54.3145946Z ) -> None: 2025-05-07T20:32:54.3146165Z torch.manual_seed(2025) 2025-05-07T20:32:54.3146416Z 2025-05-07T20:32:54.3146719Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.3147087Z 2025-05-07T20:32:54.3147292Z x_sign = torch.sign(x) 2025-05-07T20:32:54.3147592Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.3147994Z x = x_sign * x_clamp 2025-05-07T20:32:54.3148247Z x0 = x[:, :D] 2025-05-07T20:32:54.3148468Z x1 = x[:, D:] 2025-05-07T20:32:54.3148683Z 2025-05-07T20:32:54.3148867Z if contiguous: 2025-05-07T20:32:54.3149105Z x0 = x0.contiguous() 2025-05-07T20:32:54.3149364Z x1 = x1.contiguous() 2025-05-07T20:32:54.3149598Z 2025-05-07T20:32:54.3149796Z if scale_ub is not None: 2025-05-07T20:32:54.3150074Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.3150407Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.3150719Z ) 2025-05-07T20:32:54.3150915Z else: 2025-05-07T20:32:54.3151128Z scale_ub_tensor = None 2025-05-07T20:32:54.3151390Z 2025-05-07T20:32:54.3151625Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.3151938Z op = silu_mul_quant 2025-05-07T20:32:54.3152193Z if compiled: 2025-05-07T20:32:54.3152451Z op = torch.compile(op) 2025-05-07T20:32:54.3152745Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.3153024Z 2025-05-07T20:32:54.3153225Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.3153393Z 2025-05-07T20:32:54.3153500Z moe/activation_test.py:117: 2025-05-07T20:32:54.3153792Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.3154127Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.3154428Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.3154988Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:54.3155546Z return fn(*args, **kwargs) 2025-05-07T20:32:54.3156203Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.3156895Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.3157554Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.3158237Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.3158899Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.3159429Z kernel = self.compile( 2025-05-07T20:32:54.3159966Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.3160668Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.3161112Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.3161338Z 2025-05-07T20:32:54.3161553Z self = 2025-05-07T20:32:54.3162618Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.3163992Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3254cf24d0>} 2025-05-07T20:32:54.3165330Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.3166352Z context = 2025-05-07T20:32:54.3166667Z 2025-05-07T20:32:54.3166859Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.3167382Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.3167853Z module_map=module_map) 2025-05-07T20:32:54.3168276Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.3168631Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.3168900Z E ^ 2025-05-07T20:32:54.3169367Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.3169810Z 2025-05-07T20:32:54.3170224Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.3170740Z 2025-05-07T20:32:54.3170848Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.3171263Z self=, 2025-05-07T20:32:54.3171664Z T=1, 2025-05-07T20:32:54.3171846Z D=7168, 2025-05-07T20:32:54.3172042Z scale_ub=None, 2025-05-07T20:32:54.3172260Z contiguous=False, 2025-05-07T20:32:54.3172485Z compiled=True, 2025-05-07T20:32:54.3172692Z ) 2025-05-07T20:32:54.5778627Z self = 2025-05-07T20:32:54.5779690Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:54.5780327Z 2025-05-07T20:32:54.5780453Z @given( 2025-05-07T20:32:54.5780805Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.5781284Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.5781742Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.5782243Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.5782746Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.5783165Z ) 2025-05-07T20:32:54.5783703Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.5784367Z def test_silu_mul_quant( 2025-05-07T20:32:54.5784728Z self, 2025-05-07T20:32:54.5785022Z T: int, 2025-05-07T20:32:54.5785318Z D: int, 2025-05-07T20:32:54.5785912Z scale_ub: Optional[float], 2025-05-07T20:32:54.5786315Z contiguous: bool, 2025-05-07T20:32:54.5786640Z compiled: bool, 2025-05-07T20:32:54.5786872Z ) -> None: 2025-05-07T20:32:54.5787087Z torch.manual_seed(2025) 2025-05-07T20:32:54.5787331Z 2025-05-07T20:32:54.5787607Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.5787942Z 2025-05-07T20:32:54.5788139Z x_sign = torch.sign(x) 2025-05-07T20:32:54.5788435Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.5788830Z x = x_sign * x_clamp 2025-05-07T20:32:54.5789073Z x0 = x[:, :D] 2025-05-07T20:32:54.5789293Z x1 = x[:, D:] 2025-05-07T20:32:54.5789575Z 2025-05-07T20:32:54.5789770Z if contiguous: 2025-05-07T20:32:54.5790276Z x0 = x0.contiguous() 2025-05-07T20:32:54.5790540Z x1 = x1.contiguous() 2025-05-07T20:32:54.5790791Z 2025-05-07T20:32:54.5790997Z if scale_ub is not None: 2025-05-07T20:32:54.5791279Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.5791640Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.5791963Z ) 2025-05-07T20:32:54.5792170Z else: 2025-05-07T20:32:54.5792382Z scale_ub_tensor = None 2025-05-07T20:32:54.5792644Z 2025-05-07T20:32:54.5792886Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.5793201Z op = silu_mul_quant 2025-05-07T20:32:54.5793462Z if compiled: 2025-05-07T20:32:54.5793724Z op = torch.compile(op) 2025-05-07T20:32:54.5794034Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.5794318Z 2025-05-07T20:32:54.5794530Z y_fp8, y_scale = fn() 2025-05-07T20:32:54.5794817Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:54.5795118Z 2025-05-07T20:32:54.5795367Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.5795704Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:54.5796091Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:54.5796417Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:54.5796782Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:54.5797089Z 2025-05-07T20:32:54.5797298Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:54.5797495Z 2025-05-07T20:32:54.5797605Z moe/activation_test.py:126: 2025-05-07T20:32:54.5797903Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.5798246Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:54.5798585Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:54.5799374Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:54.5800124Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:54.5800692Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.5801381Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.5802065Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:54.5802789Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:54.5803541Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:54.5804296Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:54.5805021Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:54.5805662Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:54.5806345Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:54.5806868Z fn() 2025-05-07T20:32:54.5807372Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:54.5807963Z self.fn.run( 2025-05-07T20:32:54.5808438Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.5808975Z kernel = self.compile( 2025-05-07T20:32:54.5809521Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.5810315Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.5810727Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.5810958Z 2025-05-07T20:32:54.5811168Z self = 2025-05-07T20:32:54.5812260Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.5813639Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3254552dd0>} 2025-05-07T20:32:54.5814973Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.5815995Z context = 2025-05-07T20:32:54.5816289Z 2025-05-07T20:32:54.5816461Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.5817032Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.5817514Z module_map=module_map) 2025-05-07T20:32:54.5817884Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.5818248Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:54.5818524Z E ^ 2025-05-07T20:32:54.5818988Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.5819440Z 2025-05-07T20:32:54.5819906Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.5820439Z 2025-05-07T20:32:54.5820551Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.5820967Z self=, 2025-05-07T20:32:54.5821369Z T=1, 2025-05-07T20:32:54.5821564Z D=5120, 2025-05-07T20:32:54.5821768Z scale_ub=1200.0, 2025-05-07T20:32:54.5821998Z contiguous=False, 2025-05-07T20:32:54.5822238Z compiled=True, 2025-05-07T20:32:54.5822457Z ) 2025-05-07T20:32:54.7514465Z self = 2025-05-07T20:32:54.7515224Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:54.7515592Z 2025-05-07T20:32:54.7515707Z @given( 2025-05-07T20:32:54.7516016Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.7516373Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.7516681Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.7517029Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.7517372Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.7517663Z ) 2025-05-07T20:32:54.7518011Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.7518454Z def test_silu_mul_quant( 2025-05-07T20:32:54.7518705Z self, 2025-05-07T20:32:54.7519213Z T: int, 2025-05-07T20:32:54.7519422Z D: int, 2025-05-07T20:32:54.7519647Z scale_ub: Optional[float], 2025-05-07T20:32:54.7519921Z contiguous: bool, 2025-05-07T20:32:54.7520163Z compiled: bool, 2025-05-07T20:32:54.7520397Z ) -> None: 2025-05-07T20:32:54.7520620Z torch.manual_seed(2025) 2025-05-07T20:32:54.7520856Z 2025-05-07T20:32:54.7521132Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.7521473Z 2025-05-07T20:32:54.7521663Z x_sign = torch.sign(x) 2025-05-07T20:32:54.7522049Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.7522361Z x = x_sign * x_clamp 2025-05-07T20:32:54.7522683Z x0 = x[:, :D] 2025-05-07T20:32:54.7522910Z x1 = x[:, D:] 2025-05-07T20:32:54.7523125Z 2025-05-07T20:32:54.7523308Z if contiguous: 2025-05-07T20:32:54.7523551Z x0 = x0.contiguous() 2025-05-07T20:32:54.7523813Z x1 = x1.contiguous() 2025-05-07T20:32:54.7524054Z 2025-05-07T20:32:54.7524254Z if scale_ub is not None: 2025-05-07T20:32:54.7524533Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.7524867Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.7525177Z ) 2025-05-07T20:32:54.7525375Z else: 2025-05-07T20:32:54.7525591Z scale_ub_tensor = None 2025-05-07T20:32:54.7525843Z 2025-05-07T20:32:54.7526076Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.7526395Z op = silu_mul_quant 2025-05-07T20:32:54.7526649Z if compiled: 2025-05-07T20:32:54.7526945Z op = torch.compile(op) 2025-05-07T20:32:54.7527255Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.7527528Z 2025-05-07T20:32:54.7527724Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.7527892Z 2025-05-07T20:32:54.7528002Z moe/activation_test.py:117: 2025-05-07T20:32:54.7528413Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.7528749Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.7529034Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.7529596Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:54.7530147Z return fn(*args, **kwargs) 2025-05-07T20:32:54.7530804Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.7531492Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.7532026Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.7532703Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.7533370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.7533905Z kernel = self.compile( 2025-05-07T20:32:54.7534442Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.7535098Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.7535490Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.7535715Z 2025-05-07T20:32:54.7535931Z self = 2025-05-07T20:32:54.7536996Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.7538369Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3254553eb0>} 2025-05-07T20:32:54.7539828Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.7540860Z context = 2025-05-07T20:32:54.7541144Z 2025-05-07T20:32:54.7541311Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.7541831Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.7542350Z module_map=module_map) 2025-05-07T20:32:54.7542759Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.7543111Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.7543374Z E ^ 2025-05-07T20:32:54.7543846Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.7544299Z 2025-05-07T20:32:54.7544716Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.7545239Z 2025-05-07T20:32:54.7545347Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.7545764Z self=, 2025-05-07T20:32:54.7546166Z T=1, 2025-05-07T20:32:54.7546352Z D=5120, 2025-05-07T20:32:54.7546554Z scale_ub=1200.0, 2025-05-07T20:32:54.7546781Z contiguous=False, 2025-05-07T20:32:54.7547011Z compiled=False, 2025-05-07T20:32:54.7547227Z ) 2025-05-07T20:32:54.7547549Z self = 2025-05-07T20:32:54.7548041Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:54.7548312Z 2025-05-07T20:32:54.7548390Z @given( 2025-05-07T20:32:54.7548623Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.7548985Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.7549293Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.7549628Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.7549957Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.7550236Z ) 2025-05-07T20:32:54.7550586Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.7551029Z def test_silu_mul_quant( 2025-05-07T20:32:54.7551266Z self, 2025-05-07T20:32:54.7551462Z T: int, 2025-05-07T20:32:54.7551670Z D: int, 2025-05-07T20:32:54.7551887Z scale_ub: Optional[float], 2025-05-07T20:32:54.7552163Z contiguous: bool, 2025-05-07T20:32:54.7552403Z compiled: bool, 2025-05-07T20:32:54.7552625Z ) -> None: 2025-05-07T20:32:54.7552841Z torch.manual_seed(2025) 2025-05-07T20:32:54.7553081Z 2025-05-07T20:32:54.7553352Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.7553702Z 2025-05-07T20:32:54.7553897Z x_sign = torch.sign(x) 2025-05-07T20:32:54.7554190Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.7554497Z x = x_sign * x_clamp 2025-05-07T20:32:54.7554741Z x0 = x[:, :D] 2025-05-07T20:32:54.7554962Z x1 = x[:, D:] 2025-05-07T20:32:54.7555164Z 2025-05-07T20:32:54.7555358Z if contiguous: 2025-05-07T20:32:54.7555593Z x0 = x0.contiguous() 2025-05-07T20:32:54.7555851Z x1 = x1.contiguous() 2025-05-07T20:32:54.7556095Z 2025-05-07T20:32:54.7556289Z if scale_ub is not None: 2025-05-07T20:32:54.7556561Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.7556900Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.7557208Z ) 2025-05-07T20:32:54.7557398Z else: 2025-05-07T20:32:54.7557618Z scale_ub_tensor = None 2025-05-07T20:32:54.7557872Z 2025-05-07T20:32:54.7558157Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.7565039Z op = silu_mul_quant 2025-05-07T20:32:54.7565323Z if compiled: 2025-05-07T20:32:54.7565593Z op = torch.compile(op) 2025-05-07T20:32:54.7565897Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.7566208Z 2025-05-07T20:32:54.7566417Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.7566588Z 2025-05-07T20:32:54.7566701Z moe/activation_test.py:117: 2025-05-07T20:32:54.7567003Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.7567436Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.7567774Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.7568476Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.7569169Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.7569723Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.7570412Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.7571084Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.7571628Z kernel = self.compile( 2025-05-07T20:32:54.7572181Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.7572849Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.7573249Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.7573489Z 2025-05-07T20:32:54.7573698Z self = 2025-05-07T20:32:54.7574827Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.7576196Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3254cf0940>} 2025-05-07T20:32:54.7577550Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.7578592Z context = 2025-05-07T20:32:54.7578889Z 2025-05-07T20:32:54.7579060Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.7579599Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.7580165Z module_map=module_map) 2025-05-07T20:32:54.7580550Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.7580914Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.7581186Z E ^ 2025-05-07T20:32:54.7581658Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.7582117Z 2025-05-07T20:32:54.7582540Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.7583053Z 2025-05-07T20:32:54.7583170Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.7583592Z self=, 2025-05-07T20:32:54.7584007Z T=16384, 2025-05-07T20:32:54.7584216Z D=5120, 2025-05-07T20:32:54.7584424Z scale_ub=1200.0, 2025-05-07T20:32:54.7584652Z contiguous=False, 2025-05-07T20:32:54.7584884Z compiled=True, 2025-05-07T20:32:54.7585099Z ) 2025-05-07T20:32:54.8586118Z self = 2025-05-07T20:32:54.8587060Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:54.8587344Z 2025-05-07T20:32:54.8587422Z @given( 2025-05-07T20:32:54.8587665Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.8587988Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.8588294Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.8588633Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.8589231Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.8589522Z ) 2025-05-07T20:32:54.8590243Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.8590697Z def test_silu_mul_quant( 2025-05-07T20:32:54.8590949Z self, 2025-05-07T20:32:54.8591144Z T: int, 2025-05-07T20:32:54.8591350Z D: int, 2025-05-07T20:32:54.8591577Z scale_ub: Optional[float], 2025-05-07T20:32:54.8591866Z contiguous: bool, 2025-05-07T20:32:54.8592121Z compiled: bool, 2025-05-07T20:32:54.8592357Z ) -> None: 2025-05-07T20:32:54.8592575Z torch.manual_seed(2025) 2025-05-07T20:32:54.8592827Z 2025-05-07T20:32:54.8593111Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.8593453Z 2025-05-07T20:32:54.8593654Z x_sign = torch.sign(x) 2025-05-07T20:32:54.8593951Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.8594263Z x = x_sign * x_clamp 2025-05-07T20:32:54.8594517Z x0 = x[:, :D] 2025-05-07T20:32:54.8594745Z x1 = x[:, D:] 2025-05-07T20:32:54.8594951Z 2025-05-07T20:32:54.8595156Z if contiguous: 2025-05-07T20:32:54.8595389Z x0 = x0.contiguous() 2025-05-07T20:32:54.8595651Z x1 = x1.contiguous() 2025-05-07T20:32:54.8595894Z 2025-05-07T20:32:54.8596086Z if scale_ub is not None: 2025-05-07T20:32:54.8596455Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.8596803Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.8597117Z ) 2025-05-07T20:32:54.8597308Z else: 2025-05-07T20:32:54.8597529Z scale_ub_tensor = None 2025-05-07T20:32:54.8597780Z 2025-05-07T20:32:54.8598014Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.8598329Z op = silu_mul_quant 2025-05-07T20:32:54.8598593Z if compiled: 2025-05-07T20:32:54.8598842Z op = torch.compile(op) 2025-05-07T20:32:54.8599145Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.8599424Z 2025-05-07T20:32:54.8599618Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.8599790Z 2025-05-07T20:32:54.8599892Z moe/activation_test.py:117: 2025-05-07T20:32:54.8600192Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.8600524Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.8600820Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.8601386Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:54.8601951Z return fn(*args, **kwargs) 2025-05-07T20:32:54.8602614Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.8603301Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.8603840Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.8604517Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.8605184Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.8605712Z kernel = self.compile( 2025-05-07T20:32:54.8606259Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.8606993Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.8607393Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.8607620Z 2025-05-07T20:32:54.8607830Z self = 2025-05-07T20:32:54.8608904Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.8610430Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f31337c88b0>} 2025-05-07T20:32:54.8611765Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.8612788Z context = 2025-05-07T20:32:54.8613073Z 2025-05-07T20:32:54.8613248Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.8613762Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.8614230Z module_map=module_map) 2025-05-07T20:32:54.8614601Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.8614956Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.8615214Z E ^ 2025-05-07T20:32:54.8615685Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.8616130Z 2025-05-07T20:32:54.8616594Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.8617111Z 2025-05-07T20:32:54.8617221Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.8617630Z self=, 2025-05-07T20:32:54.8618034Z T=2048, 2025-05-07T20:32:54.8618228Z D=7168, 2025-05-07T20:32:54.8618419Z scale_ub=1200.0, 2025-05-07T20:32:54.8618651Z contiguous=False, 2025-05-07T20:32:54.8618880Z compiled=True, 2025-05-07T20:32:54.8619082Z ) 2025-05-07T20:32:54.8619404Z self = 2025-05-07T20:32:54.8619971Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:54.8620249Z 2025-05-07T20:32:54.8620329Z @given( 2025-05-07T20:32:54.8620565Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.8620885Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.8621196Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.8621526Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.8621865Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.8622154Z ) 2025-05-07T20:32:54.8622503Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.8622947Z def test_silu_mul_quant( 2025-05-07T20:32:54.8623191Z self, 2025-05-07T20:32:54.8623382Z T: int, 2025-05-07T20:32:54.8623586Z D: int, 2025-05-07T20:32:54.8623813Z scale_ub: Optional[float], 2025-05-07T20:32:54.8624088Z contiguous: bool, 2025-05-07T20:32:54.8624331Z compiled: bool, 2025-05-07T20:32:54.8624559Z ) -> None: 2025-05-07T20:32:54.8624772Z torch.manual_seed(2025) 2025-05-07T20:32:54.8625020Z 2025-05-07T20:32:54.8625295Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.8625640Z 2025-05-07T20:32:54.8625831Z x_sign = torch.sign(x) 2025-05-07T20:32:54.8626188Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.8626498Z x = x_sign * x_clamp 2025-05-07T20:32:54.8626757Z x0 = x[:, :D] 2025-05-07T20:32:54.8627007Z x1 = x[:, D:] 2025-05-07T20:32:54.8627217Z 2025-05-07T20:32:54.8627402Z if contiguous: 2025-05-07T20:32:54.8627638Z x0 = x0.contiguous() 2025-05-07T20:32:54.8627905Z x1 = x1.contiguous() 2025-05-07T20:32:54.8628141Z 2025-05-07T20:32:54.8628340Z if scale_ub is not None: 2025-05-07T20:32:54.8628618Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.8628997Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.8629347Z ) 2025-05-07T20:32:54.8629545Z else: 2025-05-07T20:32:54.8629753Z scale_ub_tensor = None 2025-05-07T20:32:54.8630013Z 2025-05-07T20:32:54.8630251Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.8630563Z op = silu_mul_quant 2025-05-07T20:32:54.8630822Z if compiled: 2025-05-07T20:32:54.8631075Z op = torch.compile(op) 2025-05-07T20:32:54.8631376Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.8631646Z 2025-05-07T20:32:54.8631843Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.8632012Z 2025-05-07T20:32:54.8632119Z moe/activation_test.py:117: 2025-05-07T20:32:54.8632415Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.8632749Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.8633044Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.8633602Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:54.8634167Z return fn(*args, **kwargs) 2025-05-07T20:32:54.8634824Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.8635515Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.8636094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.8636780Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.8637447Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.8637983Z kernel = self.compile( 2025-05-07T20:32:54.8638521Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.8639189Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.8639589Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.8639816Z 2025-05-07T20:32:54.8640026Z self = 2025-05-07T20:32:54.8641100Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.8642460Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f31337c9090>} 2025-05-07T20:32:54.8643811Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.8644855Z context = 2025-05-07T20:32:54.8645140Z 2025-05-07T20:32:54.8645314Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.8645832Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.8646349Z module_map=module_map) 2025-05-07T20:32:54.8646720Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.8647075Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.8647341Z E ^ 2025-05-07T20:32:54.8647809Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.8648254Z 2025-05-07T20:32:54.8648675Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.8649239Z 2025-05-07T20:32:54.9940977Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.9941799Z self=, 2025-05-07T20:32:54.9942330Z T=1, 2025-05-07T20:32:54.9942525Z D=5120, 2025-05-07T20:32:54.9942727Z scale_ub=None, 2025-05-07T20:32:54.9942946Z contiguous=False, 2025-05-07T20:32:54.9943176Z compiled=False, 2025-05-07T20:32:54.9943397Z ) 2025-05-07T20:32:54.9943723Z self = 2025-05-07T20:32:54.9944216Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:54.9944483Z 2025-05-07T20:32:54.9944563Z @given( 2025-05-07T20:32:54.9944794Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.9945105Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.9945418Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.9945760Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.9946096Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.9946387Z ) 2025-05-07T20:32:54.9946752Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.9947190Z def test_silu_mul_quant( 2025-05-07T20:32:54.9947436Z self, 2025-05-07T20:32:54.9947639Z T: int, 2025-05-07T20:32:54.9947844Z D: int, 2025-05-07T20:32:54.9948156Z scale_ub: Optional[float], 2025-05-07T20:32:54.9948442Z contiguous: bool, 2025-05-07T20:32:54.9948695Z compiled: bool, 2025-05-07T20:32:54.9948918Z ) -> None: 2025-05-07T20:32:54.9949137Z torch.manual_seed(2025) 2025-05-07T20:32:54.9949383Z 2025-05-07T20:32:54.9949656Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.9949999Z 2025-05-07T20:32:54.9950200Z x_sign = torch.sign(x) 2025-05-07T20:32:54.9950496Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.9950813Z x = x_sign * x_clamp 2025-05-07T20:32:54.9951058Z x0 = x[:, :D] 2025-05-07T20:32:54.9951278Z x1 = x[:, D:] 2025-05-07T20:32:54.9951494Z 2025-05-07T20:32:54.9951685Z if contiguous: 2025-05-07T20:32:54.9951917Z x0 = x0.contiguous() 2025-05-07T20:32:54.9952184Z x1 = x1.contiguous() 2025-05-07T20:32:54.9952435Z 2025-05-07T20:32:54.9952643Z if scale_ub is not None: 2025-05-07T20:32:54.9952925Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.9953273Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.9953592Z ) 2025-05-07T20:32:54.9953784Z else: 2025-05-07T20:32:54.9954011Z scale_ub_tensor = None 2025-05-07T20:32:54.9954272Z 2025-05-07T20:32:54.9954507Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.9954839Z op = silu_mul_quant 2025-05-07T20:32:54.9955100Z if compiled: 2025-05-07T20:32:54.9955358Z op = torch.compile(op) 2025-05-07T20:32:54.9955670Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.9955960Z 2025-05-07T20:32:54.9956155Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.9956328Z 2025-05-07T20:32:54.9956435Z moe/activation_test.py:117: 2025-05-07T20:32:54.9956751Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.9957220Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.9957514Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.9958206Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.9958898Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.9959431Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.9960109Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.9960897Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.9961440Z kernel = self.compile( 2025-05-07T20:32:54.9961979Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.9962632Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.9963037Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.9963265Z 2025-05-07T20:32:54.9963473Z self = 2025-05-07T20:32:54.9964546Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.9965923Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f31337c97e0>} 2025-05-07T20:32:54.9967253Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.9968318Z context = 2025-05-07T20:32:54.9968605Z 2025-05-07T20:32:54.9968770Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.9969294Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.9969764Z module_map=module_map) 2025-05-07T20:32:54.9970135Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.9970489Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.9970760Z E ^ 2025-05-07T20:32:54.9971235Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.9971679Z 2025-05-07T20:32:54.9972092Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.9972605Z 2025-05-07T20:32:54.9972713Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.9973133Z self=, 2025-05-07T20:32:54.9973531Z T=4096, 2025-05-07T20:32:54.9973719Z D=7168, 2025-05-07T20:32:54.9973921Z scale_ub=1200.0, 2025-05-07T20:32:54.9974154Z contiguous=False, 2025-05-07T20:32:54.9974378Z compiled=False, 2025-05-07T20:32:54.9974587Z ) 2025-05-07T20:32:54.9974914Z self = 2025-05-07T20:32:54.9975404Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:54.9975696Z 2025-05-07T20:32:54.9975773Z @given( 2025-05-07T20:32:54.9976012Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.9976323Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.9976636Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.9976968Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.9977301Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.9977641Z ) 2025-05-07T20:32:54.9977996Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.9978440Z def test_silu_mul_quant( 2025-05-07T20:32:54.9978684Z self, 2025-05-07T20:32:54.9978887Z T: int, 2025-05-07T20:32:54.9979087Z D: int, 2025-05-07T20:32:54.9979310Z scale_ub: Optional[float], 2025-05-07T20:32:54.9979586Z contiguous: bool, 2025-05-07T20:32:54.9979948Z compiled: bool, 2025-05-07T20:32:54.9980182Z ) -> None: 2025-05-07T20:32:54.9980502Z torch.manual_seed(2025) 2025-05-07T20:32:54.9980759Z 2025-05-07T20:32:54.9981082Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.9981446Z 2025-05-07T20:32:54.9981651Z x_sign = torch.sign(x) 2025-05-07T20:32:54.9981948Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.9982270Z x = x_sign * x_clamp 2025-05-07T20:32:54.9982530Z x0 = x[:, :D] 2025-05-07T20:32:54.9982762Z x1 = x[:, D:] 2025-05-07T20:32:54.9982976Z 2025-05-07T20:32:54.9983172Z if contiguous: 2025-05-07T20:32:54.9983418Z x0 = x0.contiguous() 2025-05-07T20:32:54.9983684Z x1 = x1.contiguous() 2025-05-07T20:32:54.9983932Z 2025-05-07T20:32:54.9984137Z if scale_ub is not None: 2025-05-07T20:32:54.9984416Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.9984757Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.9985079Z ) 2025-05-07T20:32:54.9985280Z else: 2025-05-07T20:32:54.9985528Z scale_ub_tensor = None 2025-05-07T20:32:54.9985795Z 2025-05-07T20:32:54.9986038Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.9986356Z op = silu_mul_quant 2025-05-07T20:32:54.9986623Z if compiled: 2025-05-07T20:32:54.9986910Z op = torch.compile(op) 2025-05-07T20:32:54.9987292Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.9987585Z 2025-05-07T20:32:54.9987799Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.9987969Z 2025-05-07T20:32:54.9988076Z moe/activation_test.py:117: 2025-05-07T20:32:54.9988384Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.9988725Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.9989018Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.9989711Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.9990848Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.9991405Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.9992078Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.9992747Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.9993287Z kernel = self.compile( 2025-05-07T20:32:54.9993835Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.9994486Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.9994886Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.9995114Z 2025-05-07T20:32:54.9995329Z self = 2025-05-07T20:32:54.9996403Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.9997773Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f31337ca200>} 2025-05-07T20:32:54.9999217Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:55.0000243Z context = 2025-05-07T20:32:55.0000534Z 2025-05-07T20:32:55.0000710Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:55.0001309Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:55.0001844Z module_map=module_map) 2025-05-07T20:32:55.0002221Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:55.0002582Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:55.0002845Z E ^ 2025-05-07T20:32:55.0003320Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:55.0003777Z 2025-05-07T20:32:55.0004201Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:55.0004715Z 2025-05-07T20:32:55.0004828Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:55.0005243Z self=, 2025-05-07T20:32:55.0005654Z T=16384, 2025-05-07T20:32:55.0005858Z D=7168, 2025-05-07T20:32:55.0006064Z scale_ub=None, 2025-05-07T20:32:55.0006290Z contiguous=True, 2025-05-07T20:32:55.0006520Z compiled=True, 2025-05-07T20:32:55.0006725Z ) 2025-05-07T20:32:55.1947980Z self = 2025-05-07T20:32:55.1948628Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:55.1949017Z 2025-05-07T20:32:55.1949132Z @given( 2025-05-07T20:32:55.1949715Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:55.1950132Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:55.1950460Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:55.1950788Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:55.1951127Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:55.1951420Z ) 2025-05-07T20:32:55.1951772Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:55.1952222Z def test_silu_mul_quant( 2025-05-07T20:32:55.1952480Z self, 2025-05-07T20:32:55.1952676Z T: int, 2025-05-07T20:32:55.1952883Z D: int, 2025-05-07T20:32:55.1953115Z scale_ub: Optional[float], 2025-05-07T20:32:55.1953386Z contiguous: bool, 2025-05-07T20:32:55.1953641Z compiled: bool, 2025-05-07T20:32:55.1953877Z ) -> None: 2025-05-07T20:32:55.1954103Z torch.manual_seed(2025) 2025-05-07T20:32:55.1954351Z 2025-05-07T20:32:55.1954639Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:55.1954987Z 2025-05-07T20:32:55.1955180Z x_sign = torch.sign(x) 2025-05-07T20:32:55.1955479Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:55.1955798Z x = x_sign * x_clamp 2025-05-07T20:32:55.1956042Z x0 = x[:, :D] 2025-05-07T20:32:55.1956268Z x1 = x[:, D:] 2025-05-07T20:32:55.1956486Z 2025-05-07T20:32:55.1956677Z if contiguous: 2025-05-07T20:32:55.1956918Z x0 = x0.contiguous() 2025-05-07T20:32:55.1957187Z x1 = x1.contiguous() 2025-05-07T20:32:55.1957423Z 2025-05-07T20:32:55.1957625Z if scale_ub is not None: 2025-05-07T20:32:55.1957919Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:55.1958258Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:55.1958569Z ) 2025-05-07T20:32:55.1958774Z else: 2025-05-07T20:32:55.1959095Z scale_ub_tensor = None 2025-05-07T20:32:55.1959350Z 2025-05-07T20:32:55.1959594Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:55.1959912Z op = silu_mul_quant 2025-05-07T20:32:55.1960161Z if compiled: 2025-05-07T20:32:55.1960418Z op = torch.compile(op) 2025-05-07T20:32:55.1960719Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.1960989Z 2025-05-07T20:32:55.1961197Z > y_fp8, y_scale = fn() 2025-05-07T20:32:55.1961363Z 2025-05-07T20:32:55.1961471Z moe/activation_test.py:117: 2025-05-07T20:32:55.1961858Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.1962268Z moe/activation_test.py:115: in fn 2025-05-07T20:32:55.1962560Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.1963125Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:55.1963689Z return fn(*args, **kwargs) 2025-05-07T20:32:55.1964356Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:55.1965047Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:55.1965577Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:55.1966257Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:55.1966923Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:55.1967466Z kernel = self.compile( 2025-05-07T20:32:55.1968012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:55.1968669Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:55.1969071Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.1969346Z 2025-05-07T20:32:55.1969566Z self = 2025-05-07T20:32:55.1970634Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:55.1972015Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f31337cb760>} 2025-05-07T20:32:55.1973372Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:55.1974399Z context = 2025-05-07T20:32:55.1974684Z 2025-05-07T20:32:55.1974858Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:55.1975383Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:55.1975857Z module_map=module_map) 2025-05-07T20:32:55.1976223Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:55.1976580Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:55.1976868Z E ^ 2025-05-07T20:32:55.1977365Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:55.1977823Z 2025-05-07T20:32:55.1978238Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:55.1978752Z 2025-05-07T20:32:55.1978858Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:55.1979270Z self=, 2025-05-07T20:32:55.1979728Z T=4096, 2025-05-07T20:32:55.1980000Z D=5120, 2025-05-07T20:32:55.1980196Z scale_ub=None, 2025-05-07T20:32:55.1980415Z contiguous=False, 2025-05-07T20:32:55.1980643Z compiled=True, 2025-05-07T20:32:55.1980851Z ) 2025-05-07T20:32:55.1981172Z self = 2025-05-07T20:32:55.1981659Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:55.1981936Z 2025-05-07T20:32:55.1982013Z @given( 2025-05-07T20:32:55.1982247Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:55.1982604Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:55.1982949Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:55.1983283Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:55.1983610Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:55.1983890Z ) 2025-05-07T20:32:55.1984246Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:55.1984692Z def test_silu_mul_quant( 2025-05-07T20:32:55.1984930Z self, 2025-05-07T20:32:55.1985129Z T: int, 2025-05-07T20:32:55.1985327Z D: int, 2025-05-07T20:32:55.1985547Z scale_ub: Optional[float], 2025-05-07T20:32:55.1985819Z contiguous: bool, 2025-05-07T20:32:55.1986063Z compiled: bool, 2025-05-07T20:32:55.1986281Z ) -> None: 2025-05-07T20:32:55.1986502Z torch.manual_seed(2025) 2025-05-07T20:32:55.1986747Z 2025-05-07T20:32:55.1987015Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:55.1987360Z 2025-05-07T20:32:55.1987562Z x_sign = torch.sign(x) 2025-05-07T20:32:55.1987853Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:55.1988166Z x = x_sign * x_clamp 2025-05-07T20:32:55.1988413Z x0 = x[:, :D] 2025-05-07T20:32:55.1988633Z x1 = x[:, D:] 2025-05-07T20:32:55.1988842Z 2025-05-07T20:32:55.1989081Z if contiguous: 2025-05-07T20:32:55.1989324Z x0 = x0.contiguous() 2025-05-07T20:32:55.1989581Z x1 = x1.contiguous() 2025-05-07T20:32:55.1990231Z 2025-05-07T20:32:55.1990436Z if scale_ub is not None: 2025-05-07T20:32:55.1990713Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:55.1991054Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:55.1991364Z ) 2025-05-07T20:32:55.1991558Z else: 2025-05-07T20:32:55.1991777Z scale_ub_tensor = None 2025-05-07T20:32:55.1992040Z 2025-05-07T20:32:55.1992271Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:55.1992593Z op = silu_mul_quant 2025-05-07T20:32:55.1992854Z if compiled: 2025-05-07T20:32:55.1993101Z op = torch.compile(op) 2025-05-07T20:32:55.1993399Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.1993681Z 2025-05-07T20:32:55.1993881Z > y_fp8, y_scale = fn() 2025-05-07T20:32:55.1994053Z 2025-05-07T20:32:55.1994154Z moe/activation_test.py:117: 2025-05-07T20:32:55.1994456Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.1994795Z moe/activation_test.py:115: in fn 2025-05-07T20:32:55.1995077Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.1995640Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:55.1996198Z return fn(*args, **kwargs) 2025-05-07T20:32:55.1996853Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:55.1997542Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:55.1998078Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:55.1998752Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:55.1999500Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:55.2000029Z kernel = self.compile( 2025-05-07T20:32:55.2000569Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:55.2001222Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:55.2001613Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.2001911Z 2025-05-07T20:32:55.2002118Z self = 2025-05-07T20:32:55.2003247Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:55.2004608Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3133d2c280>} 2025-05-07T20:32:55.2005933Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:55.2006992Z context = 2025-05-07T20:32:55.2007293Z 2025-05-07T20:32:55.2007461Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:55.2007991Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:55.2008452Z module_map=module_map) 2025-05-07T20:32:55.2008823Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:55.2009178Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:55.2009438Z E ^ 2025-05-07T20:32:55.2009970Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:55.2010431Z 2025-05-07T20:32:55.2010844Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:55.2011350Z 2025-05-07T20:32:55.5333197Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:55.5333682Z self=, 2025-05-07T20:32:55.5334086Z T=4096, 2025-05-07T20:32:55.5334303Z D=5120, 2025-05-07T20:32:55.5334504Z scale_ub=1200.0, 2025-05-07T20:32:55.5334730Z contiguous=False, 2025-05-07T20:32:55.5334970Z compiled=False, 2025-05-07T20:32:55.5335186Z ) 2025-05-07T20:32:55.5335500Z self = 2025-05-07T20:32:55.5336009Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:55.5336333Z 2025-05-07T20:32:55.5336420Z @given( 2025-05-07T20:32:55.5336658Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:55.5337128Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:55.5337437Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:55.5337773Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:55.5338106Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:55.5338394Z ) 2025-05-07T20:32:55.5338745Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:55.5339192Z def test_silu_mul_quant( 2025-05-07T20:32:55.5339437Z self, 2025-05-07T20:32:55.5339633Z T: int, 2025-05-07T20:32:55.5339882Z D: int, 2025-05-07T20:32:55.5340105Z scale_ub: Optional[float], 2025-05-07T20:32:55.5340399Z contiguous: bool, 2025-05-07T20:32:55.5340640Z compiled: bool, 2025-05-07T20:32:55.5340875Z ) -> None: 2025-05-07T20:32:55.5341401Z torch.manual_seed(2025) 2025-05-07T20:32:55.5341648Z 2025-05-07T20:32:55.5341925Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:55.5342276Z 2025-05-07T20:32:55.5342472Z x_sign = torch.sign(x) 2025-05-07T20:32:55.5342774Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:55.5343088Z x = x_sign * x_clamp 2025-05-07T20:32:55.5343334Z x0 = x[:, :D] 2025-05-07T20:32:55.5343564Z x1 = x[:, D:] 2025-05-07T20:32:55.5343782Z 2025-05-07T20:32:55.5343969Z if contiguous: 2025-05-07T20:32:55.5344304Z x0 = x0.contiguous() 2025-05-07T20:32:55.5344568Z x1 = x1.contiguous() 2025-05-07T20:32:55.5344879Z 2025-05-07T20:32:55.5345077Z if scale_ub is not None: 2025-05-07T20:32:55.5345352Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:55.5345680Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:55.5345993Z ) 2025-05-07T20:32:55.5346192Z else: 2025-05-07T20:32:55.5346410Z scale_ub_tensor = None 2025-05-07T20:32:55.5346659Z 2025-05-07T20:32:55.5346922Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:55.5347262Z op = silu_mul_quant 2025-05-07T20:32:55.5347509Z if compiled: 2025-05-07T20:32:55.5347761Z op = torch.compile(op) 2025-05-07T20:32:55.5348061Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.5348326Z 2025-05-07T20:32:55.5348524Z > y_fp8, y_scale = fn() 2025-05-07T20:32:55.5348693Z 2025-05-07T20:32:55.5348803Z moe/activation_test.py:117: 2025-05-07T20:32:55.5349104Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.5349439Z moe/activation_test.py:115: in fn 2025-05-07T20:32:55.5349724Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.5350503Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:55.5351194Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:55.5351730Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:55.5352417Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:55.5353076Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:55.5353611Z kernel = self.compile( 2025-05-07T20:32:55.5354158Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:55.5354815Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:55.5355212Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.5355445Z 2025-05-07T20:32:55.5355653Z self = 2025-05-07T20:32:55.5356729Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:55.5358118Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3133d2d000>} 2025-05-07T20:32:55.5359444Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:55.5360459Z context = 2025-05-07T20:32:55.5360754Z 2025-05-07T20:32:55.5360922Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:55.5361449Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:55.5361971Z module_map=module_map) 2025-05-07T20:32:55.5362345Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:55.5362702Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:55.5362965Z E ^ 2025-05-07T20:32:55.5363422Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:55.5363873Z 2025-05-07T20:32:55.5364294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:55.5364853Z 2025-05-07T20:32:55.5365033Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:55.5365449Z self=, 2025-05-07T20:32:55.5365847Z T=4096, 2025-05-07T20:32:55.5366042Z D=5120, 2025-05-07T20:32:55.5366237Z scale_ub=1200.0, 2025-05-07T20:32:55.5366463Z contiguous=False, 2025-05-07T20:32:55.5366695Z compiled=True, 2025-05-07T20:32:55.5366901Z ) 2025-05-07T20:32:55.5367216Z self = 2025-05-07T20:32:55.5367706Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:55.5367976Z 2025-05-07T20:32:55.5368059Z @given( 2025-05-07T20:32:55.5368286Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:55.5368596Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:55.5368908Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:55.5369239Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:55.5369562Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:55.5369846Z ) 2025-05-07T20:32:55.5370195Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:55.5370628Z def test_silu_mul_quant( 2025-05-07T20:32:55.5370873Z self, 2025-05-07T20:32:55.5371120Z T: int, 2025-05-07T20:32:55.5371315Z D: int, 2025-05-07T20:32:55.5371537Z scale_ub: Optional[float], 2025-05-07T20:32:55.5371814Z contiguous: bool, 2025-05-07T20:32:55.5372046Z compiled: bool, 2025-05-07T20:32:55.5372271Z ) -> None: 2025-05-07T20:32:55.5372486Z torch.manual_seed(2025) 2025-05-07T20:32:55.5372723Z 2025-05-07T20:32:55.5372994Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:55.5373333Z 2025-05-07T20:32:55.5373524Z x_sign = torch.sign(x) 2025-05-07T20:32:55.5373820Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:55.5374134Z x = x_sign * x_clamp 2025-05-07T20:32:55.5374377Z x0 = x[:, :D] 2025-05-07T20:32:55.5374589Z x1 = x[:, D:] 2025-05-07T20:32:55.5374800Z 2025-05-07T20:32:55.5374989Z if contiguous: 2025-05-07T20:32:55.5375217Z x0 = x0.contiguous() 2025-05-07T20:32:55.5375482Z x1 = x1.contiguous() 2025-05-07T20:32:55.5375719Z 2025-05-07T20:32:55.5375908Z if scale_ub is not None: 2025-05-07T20:32:55.5376183Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:55.5376518Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:55.5376838Z ) 2025-05-07T20:32:55.5377065Z else: 2025-05-07T20:32:55.5377289Z scale_ub_tensor = None 2025-05-07T20:32:55.5377535Z 2025-05-07T20:32:55.5377767Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:55.5378083Z op = silu_mul_quant 2025-05-07T20:32:55.5378332Z if compiled: 2025-05-07T20:32:55.5378586Z op = torch.compile(op) 2025-05-07T20:32:55.5378885Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.5379155Z 2025-05-07T20:32:55.5379345Z > y_fp8, y_scale = fn() 2025-05-07T20:32:55.5379517Z 2025-05-07T20:32:55.5379616Z moe/activation_test.py:117: 2025-05-07T20:32:55.5380049Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.5380372Z moe/activation_test.py:115: in fn 2025-05-07T20:32:55.5380655Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.5381214Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:55.5381765Z return fn(*args, **kwargs) 2025-05-07T20:32:55.5382420Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:55.5383150Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:55.5383725Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:55.5384404Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:55.5385072Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:55.5385606Z kernel = self.compile( 2025-05-07T20:32:55.5386144Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:55.5386799Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:55.5387243Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.5387468Z 2025-05-07T20:32:55.5387683Z self = 2025-05-07T20:32:55.5388755Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:55.5390387Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3133d2c700>} 2025-05-07T20:32:55.5391795Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:55.5392813Z context = 2025-05-07T20:32:55.5393099Z 2025-05-07T20:32:55.5393276Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:55.5393789Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:55.5394258Z module_map=module_map) 2025-05-07T20:32:55.5394629Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:55.5394981Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:55.5395244Z E ^ 2025-05-07T20:32:55.5395709Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:55.5396158Z 2025-05-07T20:32:55.5396578Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:55.5397084Z 2025-05-07T20:32:55.6682941Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:55.6683666Z self=, 2025-05-07T20:32:55.6684278Z T=2048, 2025-05-07T20:32:55.6684563Z D=7168, 2025-05-07T20:32:55.6684854Z scale_ub=1200.0, 2025-05-07T20:32:55.6685197Z contiguous=False, 2025-05-07T20:32:55.6685569Z compiled=False, 2025-05-07T20:32:55.6685889Z ) 2025-05-07T20:32:55.6686374Z self = 2025-05-07T20:32:55.6687072Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:55.6687353Z 2025-05-07T20:32:55.6687434Z @given( 2025-05-07T20:32:55.6687671Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:55.6688262Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:55.6688571Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:55.6688909Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:55.6689238Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:55.6689520Z ) 2025-05-07T20:32:55.6690118Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:55.6690563Z def test_silu_mul_quant( 2025-05-07T20:32:55.6690803Z self, 2025-05-07T20:32:55.6691098Z T: int, 2025-05-07T20:32:55.6691301Z D: int, 2025-05-07T20:32:55.6691523Z scale_ub: Optional[float], 2025-05-07T20:32:55.6691903Z contiguous: bool, 2025-05-07T20:32:55.6692148Z compiled: bool, 2025-05-07T20:32:55.6692373Z ) -> None: 2025-05-07T20:32:55.6692598Z torch.manual_seed(2025) 2025-05-07T20:32:55.6692847Z 2025-05-07T20:32:55.6693122Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:55.6693476Z 2025-05-07T20:32:55.6693678Z x_sign = torch.sign(x) 2025-05-07T20:32:55.6693972Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:55.6694280Z x = x_sign * x_clamp 2025-05-07T20:32:55.6694526Z x0 = x[:, :D] 2025-05-07T20:32:55.6694751Z x1 = x[:, D:] 2025-05-07T20:32:55.6694958Z 2025-05-07T20:32:55.6695150Z if contiguous: 2025-05-07T20:32:55.6695388Z x0 = x0.contiguous() 2025-05-07T20:32:55.6695647Z x1 = x1.contiguous() 2025-05-07T20:32:55.6695912Z 2025-05-07T20:32:55.6696116Z if scale_ub is not None: 2025-05-07T20:32:55.6696397Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:55.6696747Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:55.6697054Z ) 2025-05-07T20:32:55.6704553Z else: 2025-05-07T20:32:55.6704792Z scale_ub_tensor = None 2025-05-07T20:32:55.6705073Z 2025-05-07T20:32:55.6705449Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:55.6705785Z op = silu_mul_quant 2025-05-07T20:32:55.6706056Z if compiled: 2025-05-07T20:32:55.6706319Z op = torch.compile(op) 2025-05-07T20:32:55.6706635Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.6706933Z 2025-05-07T20:32:55.6707134Z > y_fp8, y_scale = fn() 2025-05-07T20:32:55.6707315Z 2025-05-07T20:32:55.6707423Z moe/activation_test.py:117: 2025-05-07T20:32:55.6707738Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.6708089Z moe/activation_test.py:115: in fn 2025-05-07T20:32:55.6708378Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.6709081Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:55.6709785Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:55.6710331Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:55.6711024Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:55.6711691Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:55.6712232Z kernel = self.compile( 2025-05-07T20:32:55.6712774Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:55.6713434Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:55.6713840Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.6714069Z 2025-05-07T20:32:55.6714282Z self = 2025-05-07T20:32:55.6715356Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:55.6716797Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3133d2d240>} 2025-05-07T20:32:55.6718134Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:55.6719208Z context = 2025-05-07T20:32:55.6719535Z 2025-05-07T20:32:55.6719715Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:55.6720234Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:55.6720707Z module_map=module_map) 2025-05-07T20:32:55.6721089Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:55.6721445Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:55.6721713Z E ^ 2025-05-07T20:32:55.6722186Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:55.6722635Z 2025-05-07T20:32:55.6723059Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:55.6723580Z 2025-05-07T20:32:55.6723687Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:55.6724110Z self=, 2025-05-07T20:32:55.6724518Z T=1, 2025-05-07T20:32:55.6724706Z D=7168, 2025-05-07T20:32:55.6724913Z scale_ub=None, 2025-05-07T20:32:55.6725140Z contiguous=True, 2025-05-07T20:32:55.6725365Z compiled=False, 2025-05-07T20:32:55.6725581Z ) 2025-05-07T20:32:55.6725955Z self = 2025-05-07T20:32:55.6726445Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:55.6726714Z 2025-05-07T20:32:55.6726795Z @given( 2025-05-07T20:32:55.6727037Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:55.6727357Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:55.6727661Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:55.6727996Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:55.6728330Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:55.6728613Z ) 2025-05-07T20:32:55.6728972Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:55.6729417Z def test_silu_mul_quant( 2025-05-07T20:32:55.6729654Z self, 2025-05-07T20:32:55.6729861Z T: int, 2025-05-07T20:32:55.6730064Z D: int, 2025-05-07T20:32:55.6730287Z scale_ub: Optional[float], 2025-05-07T20:32:55.6730563Z contiguous: bool, 2025-05-07T20:32:55.6730810Z compiled: bool, 2025-05-07T20:32:55.6731041Z ) -> None: 2025-05-07T20:32:55.6731257Z torch.manual_seed(2025) 2025-05-07T20:32:55.6731506Z 2025-05-07T20:32:55.6731784Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:55.6732124Z 2025-05-07T20:32:55.6732321Z x_sign = torch.sign(x) 2025-05-07T20:32:55.6732616Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:55.6732927Z x = x_sign * x_clamp 2025-05-07T20:32:55.6733173Z x0 = x[:, :D] 2025-05-07T20:32:55.6733392Z x1 = x[:, D:] 2025-05-07T20:32:55.6733605Z 2025-05-07T20:32:55.6733791Z if contiguous: 2025-05-07T20:32:55.6734020Z x0 = x0.contiguous() 2025-05-07T20:32:55.6734281Z x1 = x1.contiguous() 2025-05-07T20:32:55.6734518Z 2025-05-07T20:32:55.6734711Z if scale_ub is not None: 2025-05-07T20:32:55.6735042Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:55.6735373Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:55.6735678Z ) 2025-05-07T20:32:55.6735877Z else: 2025-05-07T20:32:55.6736095Z scale_ub_tensor = None 2025-05-07T20:32:55.6736342Z 2025-05-07T20:32:55.6736580Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:55.6736898Z op = silu_mul_quant 2025-05-07T20:32:55.6737169Z if compiled: 2025-05-07T20:32:55.6737491Z op = torch.compile(op) 2025-05-07T20:32:55.6737788Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.6738060Z 2025-05-07T20:32:55.6738289Z > y_fp8, y_scale = fn() 2025-05-07T20:32:55.6738465Z 2025-05-07T20:32:55.6738564Z moe/activation_test.py:117: 2025-05-07T20:32:55.6738864Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.6739193Z moe/activation_test.py:115: in fn 2025-05-07T20:32:55.6739477Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.6740337Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:55.6741027Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:55.6741564Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:55.6742239Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:55.6742901Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:55.6743428Z kernel = self.compile( 2025-05-07T20:32:55.6743965Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:55.6744612Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:55.6745059Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.6745282Z 2025-05-07T20:32:55.6745486Z self = 2025-05-07T20:32:55.6746551Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:55.6747900Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3133d2e050>} 2025-05-07T20:32:55.6749228Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:55.6750233Z context = 2025-05-07T20:32:55.6750526Z 2025-05-07T20:32:55.6750695Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:55.6751214Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:55.6751677Z module_map=module_map) 2025-05-07T20:32:55.6752039Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:55.6752392Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:55.6752654Z E ^ 2025-05-07T20:32:55.6753113Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:55.6753559Z 2025-05-07T20:32:55.6753972Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:55.6754482Z 2025-05-07T20:32:55.6754590Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:55.6755049Z self=, 2025-05-07T20:32:55.6755444Z T=16384, 2025-05-07T20:32:55.6755639Z D=7168, 2025-05-07T20:32:55.6755836Z scale_ub=1200.0, 2025-05-07T20:32:55.6756057Z contiguous=False, 2025-05-07T20:32:55.6756287Z compiled=True, 2025-05-07T20:32:55.9389369Z ) 2025-05-07T20:32:55.9389800Z self = 2025-05-07T20:32:55.9390618Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:55.9390906Z 2025-05-07T20:32:55.9391290Z @given( 2025-05-07T20:32:55.9391533Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:55.9391935Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:55.9392245Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:55.9392588Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:55.9392926Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:55.9393226Z ) 2025-05-07T20:32:55.9393589Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:55.9394038Z def test_silu_mul_quant( 2025-05-07T20:32:55.9394281Z self, 2025-05-07T20:32:55.9394485Z T: int, 2025-05-07T20:32:55.9394690Z D: int, 2025-05-07T20:32:55.9394912Z scale_ub: Optional[float], 2025-05-07T20:32:55.9395189Z contiguous: bool, 2025-05-07T20:32:55.9395434Z compiled: bool, 2025-05-07T20:32:55.9395662Z ) -> None: 2025-05-07T20:32:55.9395885Z torch.manual_seed(2025) 2025-05-07T20:32:55.9396136Z 2025-05-07T20:32:55.9396414Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:55.9396760Z 2025-05-07T20:32:55.9396960Z x_sign = torch.sign(x) 2025-05-07T20:32:55.9397258Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:55.9397565Z x = x_sign * x_clamp 2025-05-07T20:32:55.9397812Z x0 = x[:, :D] 2025-05-07T20:32:55.9398119Z x1 = x[:, D:] 2025-05-07T20:32:55.9398331Z 2025-05-07T20:32:55.9398524Z if contiguous: 2025-05-07T20:32:55.9398764Z x0 = x0.contiguous() 2025-05-07T20:32:55.9399023Z x1 = x1.contiguous() 2025-05-07T20:32:55.9399265Z 2025-05-07T20:32:55.9399466Z if scale_ub is not None: 2025-05-07T20:32:55.9399736Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:55.9400073Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:55.9400385Z ) 2025-05-07T20:32:55.9400580Z else: 2025-05-07T20:32:55.9400799Z scale_ub_tensor = None 2025-05-07T20:32:55.9401055Z 2025-05-07T20:32:55.9401290Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:55.9401613Z op = silu_mul_quant 2025-05-07T20:32:55.9401870Z if compiled: 2025-05-07T20:32:55.9402122Z op = torch.compile(op) 2025-05-07T20:32:55.9402419Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.9402708Z 2025-05-07T20:32:55.9402908Z > y_fp8, y_scale = fn() 2025-05-07T20:32:55.9403076Z 2025-05-07T20:32:55.9403180Z moe/activation_test.py:117: 2025-05-07T20:32:55.9403489Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.9403854Z moe/activation_test.py:115: in fn 2025-05-07T20:32:55.9404147Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.9404704Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:55.9405272Z return fn(*args, **kwargs) 2025-05-07T20:32:55.9405937Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:55.9406629Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:55.9407162Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:55.9407936Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:55.9408600Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:55.9409136Z kernel = self.compile( 2025-05-07T20:32:55.9409678Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:55.9410334Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:55.9410735Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.9411012Z 2025-05-07T20:32:55.9411259Z self = 2025-05-07T20:32:55.9412331Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:55.9413731Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3133d2f490>} 2025-05-07T20:32:55.9415069Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:55.9416090Z context = 2025-05-07T20:32:55.9416380Z 2025-05-07T20:32:55.9416555Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:55.9417080Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:55.9417551Z module_map=module_map) 2025-05-07T20:32:55.9417926Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:55.9418342Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:55.9418609Z E ^ 2025-05-07T20:32:55.9419077Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:55.9419526Z 2025-05-07T20:32:55.9420041Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:55.9420563Z 2025-05-07T20:32:55.9420672Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:55.9421089Z self=, 2025-05-07T20:32:55.9421496Z T=1, 2025-05-07T20:32:55.9421686Z D=7168, 2025-05-07T20:32:55.9421890Z scale_ub=None, 2025-05-07T20:32:55.9422114Z contiguous=False, 2025-05-07T20:32:55.9422344Z compiled=False, 2025-05-07T20:32:55.9422560Z ) 2025-05-07T20:32:55.9422881Z self = 2025-05-07T20:32:55.9423382Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:55.9423647Z 2025-05-07T20:32:55.9423728Z @given( 2025-05-07T20:32:55.9423967Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:55.9424284Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:55.9424588Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:55.9424919Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:55.9425249Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:55.9425531Z ) 2025-05-07T20:32:55.9425888Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:55.9426330Z def test_silu_mul_quant( 2025-05-07T20:32:55.9426578Z self, 2025-05-07T20:32:55.9426774Z T: int, 2025-05-07T20:32:55.9426982Z D: int, 2025-05-07T20:32:55.9427246Z scale_ub: Optional[float], 2025-05-07T20:32:55.9427530Z contiguous: bool, 2025-05-07T20:32:55.9427776Z compiled: bool, 2025-05-07T20:32:55.9428068Z ) -> None: 2025-05-07T20:32:55.9428284Z torch.manual_seed(2025) 2025-05-07T20:32:55.9428536Z 2025-05-07T20:32:55.9428815Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:55.9429155Z 2025-05-07T20:32:55.9429354Z x_sign = torch.sign(x) 2025-05-07T20:32:55.9429651Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:55.9429961Z x = x_sign * x_clamp 2025-05-07T20:32:55.9430208Z x0 = x[:, :D] 2025-05-07T20:32:55.9430433Z x1 = x[:, D:] 2025-05-07T20:32:55.9430688Z 2025-05-07T20:32:55.9430887Z if contiguous: 2025-05-07T20:32:55.9431129Z x0 = x0.contiguous() 2025-05-07T20:32:55.9431431Z x1 = x1.contiguous() 2025-05-07T20:32:55.9431681Z 2025-05-07T20:32:55.9431885Z if scale_ub is not None: 2025-05-07T20:32:55.9432165Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:55.9432498Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:55.9432817Z ) 2025-05-07T20:32:55.9433024Z else: 2025-05-07T20:32:55.9433237Z scale_ub_tensor = None 2025-05-07T20:32:55.9433501Z 2025-05-07T20:32:55.9433739Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:55.9434051Z op = silu_mul_quant 2025-05-07T20:32:55.9434315Z if compiled: 2025-05-07T20:32:55.9434571Z op = torch.compile(op) 2025-05-07T20:32:55.9434867Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.9435153Z 2025-05-07T20:32:55.9435354Z > y_fp8, y_scale = fn() 2025-05-07T20:32:55.9435521Z 2025-05-07T20:32:55.9435626Z moe/activation_test.py:117: 2025-05-07T20:32:55.9435930Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.9436264Z moe/activation_test.py:115: in fn 2025-05-07T20:32:55.9436551Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.9437282Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:55.9437982Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:55.9438525Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:55.9439202Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:55.9439866Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:55.9440409Z kernel = self.compile( 2025-05-07T20:32:55.9440948Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:55.9441595Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:55.9441989Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.9442216Z 2025-05-07T20:32:55.9442429Z self = 2025-05-07T20:32:55.9443495Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:55.9444842Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3133d2f7f0>} 2025-05-07T20:32:55.9446179Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:55.9447209Z context = 2025-05-07T20:32:55.9447493Z 2025-05-07T20:32:55.9447666Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:55.9448235Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:55.9448709Z module_map=module_map) 2025-05-07T20:32:55.9449076Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:55.9449432Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:55.9449688Z E ^ 2025-05-07T20:32:55.9450148Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:55.9450647Z 2025-05-07T20:32:55.9451108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:55.9451624Z 2025-05-07T20:32:55.9451737Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:55.9452140Z self=, 2025-05-07T20:32:55.9452539Z T=2048, 2025-05-07T20:32:55.9452732Z D=7168, 2025-05-07T20:32:55.9452925Z scale_ub=None, 2025-05-07T20:32:55.9453145Z contiguous=False, 2025-05-07T20:32:55.9453370Z compiled=True, 2025-05-07T20:32:55.9453570Z ) 2025-05-07T20:32:56.0451907Z self = 2025-05-07T20:32:56.0452476Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:56.0452757Z 2025-05-07T20:32:56.0452837Z @given( 2025-05-07T20:32:56.0453077Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:56.0453421Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:56.0453727Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:56.0454068Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:56.0454401Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:56.0454682Z ) 2025-05-07T20:32:56.0455035Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:56.0455773Z def test_silu_mul_quant( 2025-05-07T20:32:56.0456014Z self, 2025-05-07T20:32:56.0456212Z T: int, 2025-05-07T20:32:56.0456409Z D: int, 2025-05-07T20:32:56.0456622Z scale_ub: Optional[float], 2025-05-07T20:32:56.0456895Z contiguous: bool, 2025-05-07T20:32:56.0457136Z compiled: bool, 2025-05-07T20:32:56.0457356Z ) -> None: 2025-05-07T20:32:56.0457575Z torch.manual_seed(2025) 2025-05-07T20:32:56.0457818Z 2025-05-07T20:32:56.0458084Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:56.0458432Z 2025-05-07T20:32:56.0458627Z x_sign = torch.sign(x) 2025-05-07T20:32:56.0458915Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:56.0459223Z x = x_sign * x_clamp 2025-05-07T20:32:56.0459471Z x0 = x[:, :D] 2025-05-07T20:32:56.0459693Z x1 = x[:, D:] 2025-05-07T20:32:56.0460013Z 2025-05-07T20:32:56.0460204Z if contiguous: 2025-05-07T20:32:56.0460443Z x0 = x0.contiguous() 2025-05-07T20:32:56.0460699Z x1 = x1.contiguous() 2025-05-07T20:32:56.0460941Z 2025-05-07T20:32:56.0461134Z if scale_ub is not None: 2025-05-07T20:32:56.0461404Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:56.0461740Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:56.0462049Z ) 2025-05-07T20:32:56.0462237Z else: 2025-05-07T20:32:56.0462452Z scale_ub_tensor = None 2025-05-07T20:32:56.0462703Z 2025-05-07T20:32:56.0462940Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:56.0463254Z op = silu_mul_quant 2025-05-07T20:32:56.0463512Z if compiled: 2025-05-07T20:32:56.0463763Z op = torch.compile(op) 2025-05-07T20:32:56.0464064Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:56.0464337Z 2025-05-07T20:32:56.0464526Z > y_fp8, y_scale = fn() 2025-05-07T20:32:56.0464784Z 2025-05-07T20:32:56.0464892Z moe/activation_test.py:117: 2025-05-07T20:32:56.0465192Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:56.0465525Z moe/activation_test.py:115: in fn 2025-05-07T20:32:56.0465808Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:56.0466370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:56.0466929Z return fn(*args, **kwargs) 2025-05-07T20:32:56.0467578Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:56.0468412Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:56.0468952Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:56.0469626Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:56.0470281Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:56.0470810Z kernel = self.compile( 2025-05-07T20:32:56.0471354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:56.0472007Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:56.0472394Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:56.0472629Z 2025-05-07T20:32:56.0472832Z self = 2025-05-07T20:32:56.0473910Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:56.0475334Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f31334d4af0>} 2025-05-07T20:32:56.0476666Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:56.0477692Z context = 2025-05-07T20:32:56.0477980Z 2025-05-07T20:32:56.0478149Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:56.0478672Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:56.0479134Z module_map=module_map) 2025-05-07T20:32:56.0479499Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:56.0479847Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:56.0480097Z E ^ 2025-05-07T20:32:56.0480562Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:56.0481020Z 2025-05-07T20:32:56.0481431Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:56.0481937Z 2025-05-07T20:32:56.0482049Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:56.0482453Z self=, 2025-05-07T20:32:56.0482849Z T=4096, 2025-05-07T20:32:56.0483035Z D=7168, 2025-05-07T20:32:56.0483223Z scale_ub=None, 2025-05-07T20:32:56.0483440Z contiguous=False, 2025-05-07T20:32:56.0483665Z compiled=True, 2025-05-07T20:32:56.0483869Z ) 2025-05-07T20:32:56.0484188Z self = 2025-05-07T20:32:56.0484690Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:56.0484958Z 2025-05-07T20:32:56.0485042Z @given( 2025-05-07T20:32:56.0485323Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:56.0485639Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:56.0485965Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:56.0493674Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:56.0494029Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:56.0494333Z ) 2025-05-07T20:32:56.0494707Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:56.0495152Z def test_silu_mul_quant( 2025-05-07T20:32:56.0495536Z self, 2025-05-07T20:32:56.0495745Z T: int, 2025-05-07T20:32:56.0495950Z D: int, 2025-05-07T20:32:56.0496303Z scale_ub: Optional[float], 2025-05-07T20:32:56.0496669Z contiguous: bool, 2025-05-07T20:32:56.0496928Z compiled: bool, 2025-05-07T20:32:56.0497181Z ) -> None: 2025-05-07T20:32:56.0497440Z torch.manual_seed(2025) 2025-05-07T20:32:56.0497697Z 2025-05-07T20:32:56.0497978Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:56.0498331Z 2025-05-07T20:32:56.0498539Z x_sign = torch.sign(x) 2025-05-07T20:32:56.0498835Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:56.0499160Z x = x_sign * x_clamp 2025-05-07T20:32:56.0499414Z x0 = x[:, :D] 2025-05-07T20:32:56.0499644Z x1 = x[:, D:] 2025-05-07T20:32:56.0499988Z 2025-05-07T20:32:56.0500186Z if contiguous: 2025-05-07T20:32:56.0500426Z x0 = x0.contiguous() 2025-05-07T20:32:56.0500711Z x1 = x1.contiguous() 2025-05-07T20:32:56.0500961Z 2025-05-07T20:32:56.0501161Z if scale_ub is not None: 2025-05-07T20:32:56.0501448Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:56.0501793Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:56.0502103Z ) 2025-05-07T20:32:56.0502306Z else: 2025-05-07T20:32:56.0502626Z scale_ub_tensor = None 2025-05-07T20:32:56.0502888Z 2025-05-07T20:32:56.0503124Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:56.0503449Z op = silu_mul_quant 2025-05-07T20:32:56.0503712Z if compiled: 2025-05-07T20:32:56.0503965Z op = torch.compile(op) 2025-05-07T20:32:56.0504271Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:56.0504552Z 2025-05-07T20:32:56.0504748Z > y_fp8, y_scale = fn() 2025-05-07T20:32:56.0504925Z 2025-05-07T20:32:56.0505037Z moe/activation_test.py:117: 2025-05-07T20:32:56.0505344Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:56.0505683Z moe/activation_test.py:115: in fn 2025-05-07T20:32:56.0505976Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:56.0506548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:56.0507195Z return fn(*args, **kwargs) 2025-05-07T20:32:56.0508120Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:56.0508827Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:56.0509375Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:56.0510059Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:56.0510732Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:56.0511278Z kernel = self.compile( 2025-05-07T20:32:56.0511840Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:56.0512505Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:56.0512938Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:56.0513386Z 2025-05-07T20:32:56.0513629Z self = 2025-05-07T20:32:56.0514893Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:56.0516281Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f31334d4280>} 2025-05-07T20:32:56.0517736Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:56.0518817Z context = 2025-05-07T20:32:56.0519141Z 2025-05-07T20:32:56.0519314Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:56.0519935Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:56.0520409Z module_map=module_map) 2025-05-07T20:32:56.0520784Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:56.0521146Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:56.0521410Z E ^ 2025-05-07T20:32:56.0521876Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:56.0522332Z 2025-05-07T20:32:56.0522752Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:56.0523263Z 2025-05-07T20:32:56.4039100Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:56.4040033Z self=, 2025-05-07T20:32:56.4041087Z T=16384, 2025-05-07T20:32:56.4041489Z D=5120, 2025-05-07T20:32:56.4041871Z scale_ub=1200.0, 2025-05-07T20:32:56.4042309Z contiguous=False, 2025-05-07T20:32:56.4042752Z compiled=False, 2025-05-07T20:32:56.4043156Z ) 2025-05-07T20:32:56.4043781Z self = 2025-05-07T20:32:56.4044777Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:56.4045331Z 2025-05-07T20:32:56.4045494Z @given( 2025-05-07T20:32:56.4045949Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:56.4046572Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:56.4047189Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:56.4047639Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:56.4047994Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:56.4048275Z ) 2025-05-07T20:32:56.4048632Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:56.4049074Z def test_silu_mul_quant( 2025-05-07T20:32:56.4049321Z self, 2025-05-07T20:32:56.4049516Z T: int, 2025-05-07T20:32:56.4049706Z D: int, 2025-05-07T20:32:56.4049930Z scale_ub: Optional[float], 2025-05-07T20:32:56.4050200Z contiguous: bool, 2025-05-07T20:32:56.4050439Z compiled: bool, 2025-05-07T20:32:56.4050667Z ) -> None: 2025-05-07T20:32:56.4050886Z torch.manual_seed(2025) 2025-05-07T20:32:56.4051120Z 2025-05-07T20:32:56.4051400Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:56.4051741Z 2025-05-07T20:32:56.4051939Z x_sign = torch.sign(x) 2025-05-07T20:32:56.4052236Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:56.4052545Z x = x_sign * x_clamp 2025-05-07T20:32:56.4052787Z x0 = x[:, :D] 2025-05-07T20:32:56.4053009Z x1 = x[:, D:] 2025-05-07T20:32:56.4053306Z 2025-05-07T20:32:56.4053503Z if contiguous: 2025-05-07T20:32:56.4053734Z x0 = x0.contiguous() 2025-05-07T20:32:56.4053999Z x1 = x1.contiguous() 2025-05-07T20:32:56.4054238Z 2025-05-07T20:32:56.4054425Z if scale_ub is not None: 2025-05-07T20:32:56.4054702Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:56.4055045Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:56.4055351Z ) 2025-05-07T20:32:56.4055549Z else: 2025-05-07T20:32:56.4055771Z scale_ub_tensor = None 2025-05-07T20:32:56.4056104Z 2025-05-07T20:32:56.4056340Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:56.4056772Z op = silu_mul_quant 2025-05-07T20:32:56.4057020Z if compiled: 2025-05-07T20:32:56.4057274Z op = torch.compile(op) 2025-05-07T20:32:56.4057574Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:56.4057844Z 2025-05-07T20:32:56.4058045Z > y_fp8, y_scale = fn() 2025-05-07T20:32:56.4058216Z 2025-05-07T20:32:56.4058319Z moe/activation_test.py:117: 2025-05-07T20:32:56.4058619Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:56.4058947Z moe/activation_test.py:115: in fn 2025-05-07T20:32:56.4059237Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:56.4060088Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:56.4060789Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:56.4061328Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:56.4062012Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:56.4062671Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:56.4063196Z kernel = self.compile( 2025-05-07T20:32:56.4063785Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:56.4064448Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:56.4064839Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:56.4065068Z 2025-05-07T20:32:56.4065272Z self = 2025-05-07T20:32:56.4066345Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:56.4067764Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f31334d6d40>} 2025-05-07T20:32:56.4069107Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:56.4070134Z context = 2025-05-07T20:32:56.4070417Z 2025-05-07T20:32:56.4070585Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:56.4071103Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:56.4071569Z module_map=module_map) 2025-05-07T20:32:56.4071930Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:56.4072286Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:56.4072548Z E ^ 2025-05-07T20:32:56.4073013Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:56.4073511Z 2025-05-07T20:32:56.4073934Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:56.4074449Z 2025-05-07T20:32:56.4074555Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:56.4074969Z self=, 2025-05-07T20:32:56.4075369Z T=16384, 2025-05-07T20:32:56.4075559Z D=5120, 2025-05-07T20:32:56.4075756Z scale_ub=1200.0, 2025-05-07T20:32:56.4075979Z contiguous=True, 2025-05-07T20:32:56.4076196Z compiled=True, 2025-05-07T20:32:56.4076447Z ) 2025-05-07T20:32:56.4076769Z self = 2025-05-07T20:32:56.4077305Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:56.4077633Z 2025-05-07T20:32:56.4077711Z @given( 2025-05-07T20:32:56.4077945Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:56.4078252Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:56.4078568Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:56.4078903Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:56.4079232Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:56.4079520Z ) 2025-05-07T20:32:56.4079872Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:56.4080318Z def test_silu_mul_quant( 2025-05-07T20:32:56.4080554Z self, 2025-05-07T20:32:56.4080752Z T: int, 2025-05-07T20:32:56.4080956Z D: int, 2025-05-07T20:32:56.4081173Z scale_ub: Optional[float], 2025-05-07T20:32:56.4081447Z contiguous: bool, 2025-05-07T20:32:56.4081693Z compiled: bool, 2025-05-07T20:32:56.4081915Z ) -> None: 2025-05-07T20:32:56.4082135Z torch.manual_seed(2025) 2025-05-07T20:32:56.4082387Z 2025-05-07T20:32:56.4082663Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:56.4083001Z 2025-05-07T20:32:56.4083250Z x_sign = torch.sign(x) 2025-05-07T20:32:56.4083547Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:56.4083850Z x = x_sign * x_clamp 2025-05-07T20:32:56.4084093Z x0 = x[:, :D] 2025-05-07T20:32:56.4084314Z x1 = x[:, D:] 2025-05-07T20:32:56.4084519Z 2025-05-07T20:32:56.4084705Z if contiguous: 2025-05-07T20:32:56.4084945Z x0 = x0.contiguous() 2025-05-07T20:32:56.4085200Z x1 = x1.contiguous() 2025-05-07T20:32:56.4085441Z 2025-05-07T20:32:56.4085642Z if scale_ub is not None: 2025-05-07T20:32:56.4085910Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:56.4086253Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:56.4086568Z ) 2025-05-07T20:32:56.4086766Z else: 2025-05-07T20:32:56.4086973Z scale_ub_tensor = None 2025-05-07T20:32:56.4087223Z 2025-05-07T20:32:56.4087462Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:56.4087770Z op = silu_mul_quant 2025-05-07T20:32:56.4088022Z if compiled: 2025-05-07T20:32:56.4088270Z op = torch.compile(op) 2025-05-07T20:32:56.4088559Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:56.4088830Z 2025-05-07T20:32:56.4089031Z > y_fp8, y_scale = fn() 2025-05-07T20:32:56.4089197Z 2025-05-07T20:32:56.4089301Z moe/activation_test.py:117: 2025-05-07T20:32:56.4089602Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:56.4090103Z moe/activation_test.py:115: in fn 2025-05-07T20:32:56.4090401Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:56.4090961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:56.4091535Z return fn(*args, **kwargs) 2025-05-07T20:32:56.4092201Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:56.4092970Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:56.4093508Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:56.4094196Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:56.4094859Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:56.4095383Z kernel = self.compile( 2025-05-07T20:32:56.4095990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:56.4096712Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:56.4097107Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:56.4097340Z 2025-05-07T20:32:56.4097551Z self = 2025-05-07T20:32:56.4098636Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:56.4100085Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f31334d6830>} 2025-05-07T20:32:56.4101427Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:56.4102442Z context = 2025-05-07T20:32:56.4102735Z 2025-05-07T20:32:56.4102905Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:56.4103495Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:56.4103966Z module_map=module_map) 2025-05-07T20:32:56.4104332Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:56.4104695Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:56.4104965Z E ^ 2025-05-07T20:32:56.4105428Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:56.4105884Z 2025-05-07T20:32:56.4106301Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:56.4106817Z 2025-05-07T20:32:56.5982231Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:56.5982670Z self=, 2025-05-07T20:32:56.5983151Z T=16384, 2025-05-07T20:32:56.5983344Z D=5120, 2025-05-07T20:32:56.5983543Z scale_ub=None, 2025-05-07T20:32:56.5983772Z contiguous=False, 2025-05-07T20:32:56.5984000Z compiled=True, 2025-05-07T20:32:56.5984207Z ) 2025-05-07T20:32:56.5984528Z self = 2025-05-07T20:32:56.5985015Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:56.5985295Z 2025-05-07T20:32:56.5985371Z @given( 2025-05-07T20:32:56.5985605Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:56.5985917Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:56.5986221Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:56.5986550Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:56.5986878Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:56.5987160Z ) 2025-05-07T20:32:56.5987509Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:56.5987952Z def test_silu_mul_quant( 2025-05-07T20:32:56.5988293Z self, 2025-05-07T20:32:56.5988493Z T: int, 2025-05-07T20:32:56.5988696Z D: int, 2025-05-07T20:32:56.5988913Z scale_ub: Optional[float], 2025-05-07T20:32:56.5989188Z contiguous: bool, 2025-05-07T20:32:56.5989430Z compiled: bool, 2025-05-07T20:32:56.5989651Z ) -> None: 2025-05-07T20:32:56.5990023Z torch.manual_seed(2025) 2025-05-07T20:32:56.5990279Z 2025-05-07T20:32:56.5990553Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:56.5990894Z 2025-05-07T20:32:56.5991199Z x_sign = torch.sign(x) 2025-05-07T20:32:56.5991496Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:56.5991864Z x = x_sign * x_clamp 2025-05-07T20:32:56.5992134Z x0 = x[:, :D] 2025-05-07T20:32:56.5992368Z x1 = x[:, D:] 2025-05-07T20:32:56.5992585Z 2025-05-07T20:32:56.5992779Z if contiguous: 2025-05-07T20:32:56.5993029Z x0 = x0.contiguous() 2025-05-07T20:32:56.5993313Z x1 = x1.contiguous() 2025-05-07T20:32:56.5993582Z 2025-05-07T20:32:56.5993799Z if scale_ub is not None: 2025-05-07T20:32:56.5994098Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:56.5994475Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:56.5994823Z ) 2025-05-07T20:32:56.5995021Z else: 2025-05-07T20:32:56.5995248Z scale_ub_tensor = None 2025-05-07T20:32:56.5995526Z 2025-05-07T20:32:56.5995767Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:56.5996122Z op = silu_mul_quant 2025-05-07T20:32:56.5996398Z if compiled: 2025-05-07T20:32:56.5996676Z op = torch.compile(op) 2025-05-07T20:32:56.5997001Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:56.5997307Z 2025-05-07T20:32:56.5997511Z > y_fp8, y_scale = fn() 2025-05-07T20:32:56.5997696Z 2025-05-07T20:32:56.5997803Z moe/activation_test.py:117: 2025-05-07T20:32:56.5998204Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:56.5998581Z moe/activation_test.py:115: in fn 2025-05-07T20:32:56.5998889Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:56.5999550Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:56.6000218Z return fn(*args, **kwargs) 2025-05-07T20:32:56.6001010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:56.6001846Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:56.6002482Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:56.6003304Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:56.6004096Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:56.6004733Z kernel = self.compile( 2025-05-07T20:32:56.6005373Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:56.6006156Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:56.6006608Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:56.6006879Z 2025-05-07T20:32:56.6007111Z self = 2025-05-07T20:32:56.6008448Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:56.6010168Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f31334d7760>} 2025-05-07T20:32:56.6011902Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:56.6013160Z context = 2025-05-07T20:32:56.6013502Z 2025-05-07T20:32:56.6013685Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:56.6014302Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:56.6014890Z module_map=module_map) 2025-05-07T20:32:56.6015341Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:56.6015742Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:56.6016028Z E ^ 2025-05-07T20:32:56.6016567Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:56.6017129Z 2025-05-07T20:32:56.6017636Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:56.6018263Z 2025-05-07T20:32:56.6018382Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:56.6018857Z self=, 2025-05-07T20:32:56.6019319Z T=2048, 2025-05-07T20:32:56.6019520Z D=5120, 2025-05-07T20:32:56.6019726Z scale_ub=None, 2025-05-07T20:32:56.6020043Z contiguous=False, 2025-05-07T20:32:56.6020272Z compiled=True, 2025-05-07T20:32:56.6020480Z ) 2025-05-07T20:32:56.7047035Z self = 2025-05-07T20:32:56.7047784Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:56.7048071Z 2025-05-07T20:32:56.7048149Z @given( 2025-05-07T20:32:56.7048383Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:56.7048777Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:56.7049089Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:56.7049423Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:56.7049756Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:56.7050032Z ) 2025-05-07T20:32:56.7050386Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:56.7050829Z def test_silu_mul_quant( 2025-05-07T20:32:56.7051070Z self, 2025-05-07T20:32:56.7051275Z T: int, 2025-05-07T20:32:56.7051476Z D: int, 2025-05-07T20:32:56.7051695Z scale_ub: Optional[float], 2025-05-07T20:32:56.7051971Z contiguous: bool, 2025-05-07T20:32:56.7052216Z compiled: bool, 2025-05-07T20:32:56.7052433Z ) -> None: 2025-05-07T20:32:56.7052652Z torch.manual_seed(2025) 2025-05-07T20:32:56.7052893Z 2025-05-07T20:32:56.7053167Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:56.7053515Z 2025-05-07T20:32:56.7053711Z x_sign = torch.sign(x) 2025-05-07T20:32:56.7053998Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:56.7054311Z x = x_sign * x_clamp 2025-05-07T20:32:56.7054558Z x0 = x[:, :D] 2025-05-07T20:32:56.7054778Z x1 = x[:, D:] 2025-05-07T20:32:56.7054985Z 2025-05-07T20:32:56.7055173Z if contiguous: 2025-05-07T20:32:56.7055410Z x0 = x0.contiguous() 2025-05-07T20:32:56.7055670Z x1 = x1.contiguous() 2025-05-07T20:32:56.7055911Z 2025-05-07T20:32:56.7056107Z if scale_ub is not None: 2025-05-07T20:32:56.7056375Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:56.7056715Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:56.7057026Z ) 2025-05-07T20:32:56.7057215Z else: 2025-05-07T20:32:56.7063736Z scale_ub_tensor = None 2025-05-07T20:32:56.7064132Z 2025-05-07T20:32:56.7064395Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:56.7064732Z op = silu_mul_quant 2025-05-07T20:32:56.7064993Z if compiled: 2025-05-07T20:32:56.7065255Z op = torch.compile(op) 2025-05-07T20:32:56.7065565Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:56.7065849Z 2025-05-07T20:32:56.7066046Z > y_fp8, y_scale = fn() 2025-05-07T20:32:56.7066222Z 2025-05-07T20:32:56.7066328Z moe/activation_test.py:117: 2025-05-07T20:32:56.7066712Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:56.7067051Z moe/activation_test.py:115: in fn 2025-05-07T20:32:56.7067405Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:56.7067992Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:56.7068549Z return fn(*args, **kwargs) 2025-05-07T20:32:56.7069215Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:56.7069906Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:56.7070444Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:56.7071116Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:56.7071780Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:56.7072320Z kernel = self.compile( 2025-05-07T20:32:56.7072869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:56.7073517Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:56.7073918Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:56.7074148Z 2025-05-07T20:32:56.7074406Z self = 2025-05-07T20:32:56.7075486Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:56.7076869Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f31333783a0>} 2025-05-07T20:32:56.7078207Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:56.7079237Z context = 2025-05-07T20:32:56.7079522Z 2025-05-07T20:32:56.7079696Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:56.7080216Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:56.7080684Z module_map=module_map) 2025-05-07T20:32:56.7081058Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:56.7081410Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:56.7081673Z E ^ 2025-05-07T20:32:56.7082144Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:56.7082590Z 2025-05-07T20:32:56.7083013Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:56.7083530Z 2025-05-07T20:32:56.7083636Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:56.7084055Z self=, 2025-05-07T20:32:56.7084452Z T=2048, 2025-05-07T20:32:56.7085000Z D=5120, 2025-05-07T20:32:56.7085200Z scale_ub=1200.0, 2025-05-07T20:32:56.7085451Z contiguous=False, 2025-05-07T20:32:56.7085685Z compiled=True, 2025-05-07T20:32:56.7085899Z ) 2025-05-07T20:32:56.7086218Z self = 2025-05-07T20:32:56.7086721Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:56.7086996Z 2025-05-07T20:32:56.7087083Z @given( 2025-05-07T20:32:56.7087318Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:56.7087693Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:56.7088013Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:56.7088385Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:56.7088726Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:56.7089023Z ) 2025-05-07T20:32:56.7089385Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:56.7090007Z def test_silu_mul_quant( 2025-05-07T20:32:56.7090255Z self, 2025-05-07T20:32:56.7090453Z T: int, 2025-05-07T20:32:56.7090653Z D: int, 2025-05-07T20:32:56.7090876Z scale_ub: Optional[float], 2025-05-07T20:32:56.7091154Z contiguous: bool, 2025-05-07T20:32:56.7091394Z compiled: bool, 2025-05-07T20:32:56.7091626Z ) -> None: 2025-05-07T20:32:56.7091852Z torch.manual_seed(2025) 2025-05-07T20:32:56.7092091Z 2025-05-07T20:32:56.7092369Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:56.7092718Z 2025-05-07T20:32:56.7092910Z x_sign = torch.sign(x) 2025-05-07T20:32:56.7093218Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:56.7093534Z x = x_sign * x_clamp 2025-05-07T20:32:56.7093782Z x0 = x[:, :D] 2025-05-07T20:32:56.7094007Z x1 = x[:, D:] 2025-05-07T20:32:56.7094215Z 2025-05-07T20:32:56.7094410Z if contiguous: 2025-05-07T20:32:56.7094725Z x0 = x0.contiguous() 2025-05-07T20:32:56.7094988Z x1 = x1.contiguous() 2025-05-07T20:32:56.7095233Z 2025-05-07T20:32:56.7095432Z if scale_ub is not None: 2025-05-07T20:32:56.7095705Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:56.7096045Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:56.7096364Z ) 2025-05-07T20:32:56.7096562Z else: 2025-05-07T20:32:56.7096783Z scale_ub_tensor = None 2025-05-07T20:32:56.7097036Z 2025-05-07T20:32:56.7097273Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:56.7097592Z op = silu_mul_quant 2025-05-07T20:32:56.7097850Z if compiled: 2025-05-07T20:32:56.7098106Z op = torch.compile(op) 2025-05-07T20:32:56.7098400Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:56.7098671Z 2025-05-07T20:32:56.7098869Z > y_fp8, y_scale = fn() 2025-05-07T20:32:56.7099035Z 2025-05-07T20:32:56.7099140Z moe/activation_test.py:117: 2025-05-07T20:32:56.7099444Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:56.7099850Z moe/activation_test.py:115: in fn 2025-05-07T20:32:56.7100137Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:56.7100692Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:56.7101254Z return fn(*args, **kwargs) 2025-05-07T20:32:56.7101913Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:56.7102602Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:56.7103138Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:56.7103820Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:56.7104568Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:56.7105106Z kernel = self.compile( 2025-05-07T20:32:56.7105650Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:56.7106302Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:56.7106697Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:56.7106929Z 2025-05-07T20:32:56.7107207Z self = 2025-05-07T20:32:56.7108393Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:56.7109771Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3133378820>} 2025-05-07T20:32:56.7111118Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:56.7112142Z context = 2025-05-07T20:32:56.7112434Z 2025-05-07T20:32:56.7112602Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:56.7113127Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:56.7113597Z module_map=module_map) 2025-05-07T20:32:56.7113961Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:56.7114317Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:56.7114573Z E ^ 2025-05-07T20:32:56.7115078Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:56.7115528Z 2025-05-07T20:32:56.7115939Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:56.7116448Z 2025-05-07T20:32:57.0801920Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:57.0802458Z self=, 2025-05-07T20:32:57.0802928Z T=4096, 2025-05-07T20:32:57.0803141Z D=5120, 2025-05-07T20:32:57.0803353Z scale_ub=1200.0, 2025-05-07T20:32:57.0803580Z contiguous=True, 2025-05-07T20:32:57.0803810Z compiled=True, 2025-05-07T20:32:57.0804029Z ) 2025-05-07T20:32:57.0804355Z self = 2025-05-07T20:32:57.0804859Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:57.0805136Z 2025-05-07T20:32:57.0805228Z @given( 2025-05-07T20:32:57.0805464Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:57.0805786Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:57.0806103Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:57.0806446Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:57.0806785Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:57.0807082Z ) 2025-05-07T20:32:57.0807444Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:57.0807884Z def test_silu_mul_quant( 2025-05-07T20:32:57.0808136Z self, 2025-05-07T20:32:57.0808343Z T: int, 2025-05-07T20:32:57.0808545Z D: int, 2025-05-07T20:32:57.0808772Z scale_ub: Optional[float], 2025-05-07T20:32:57.0809052Z contiguous: bool, 2025-05-07T20:32:57.0809294Z compiled: bool, 2025-05-07T20:32:57.0809524Z ) -> None: 2025-05-07T20:32:57.0809748Z torch.manual_seed(2025) 2025-05-07T20:32:57.0810096Z 2025-05-07T20:32:57.0810379Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:57.0810725Z 2025-05-07T20:32:57.0810932Z x_sign = torch.sign(x) 2025-05-07T20:32:57.0811228Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:57.0811547Z x = x_sign * x_clamp 2025-05-07T20:32:57.0811796Z x0 = x[:, :D] 2025-05-07T20:32:57.0812011Z x1 = x[:, D:] 2025-05-07T20:32:57.0812221Z 2025-05-07T20:32:57.0812412Z if contiguous: 2025-05-07T20:32:57.0812714Z x0 = x0.contiguous() 2025-05-07T20:32:57.0812984Z x1 = x1.contiguous() 2025-05-07T20:32:57.0813228Z 2025-05-07T20:32:57.0813504Z if scale_ub is not None: 2025-05-07T20:32:57.0813787Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:57.0814125Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:57.0814430Z ) 2025-05-07T20:32:57.0814631Z else: 2025-05-07T20:32:57.0814854Z scale_ub_tensor = None 2025-05-07T20:32:57.0815110Z 2025-05-07T20:32:57.0815350Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:57.0815668Z op = silu_mul_quant 2025-05-07T20:32:57.0815926Z if compiled: 2025-05-07T20:32:57.0816176Z op = torch.compile(op) 2025-05-07T20:32:57.0816475Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:57.0816755Z 2025-05-07T20:32:57.0816951Z > y_fp8, y_scale = fn() 2025-05-07T20:32:57.0817123Z 2025-05-07T20:32:57.0817231Z moe/activation_test.py:117: 2025-05-07T20:32:57.0817531Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:57.0817864Z moe/activation_test.py:115: in fn 2025-05-07T20:32:57.0818149Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:57.0818714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:57.0819341Z return fn(*args, **kwargs) 2025-05-07T20:32:57.0820162Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:57.0820854Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:57.0821396Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:57.0822080Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:57.0822740Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:57.0823282Z kernel = self.compile( 2025-05-07T20:32:57.0823834Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:57.0824489Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:57.0824893Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:57.0825132Z 2025-05-07T20:32:57.0825342Z self = 2025-05-07T20:32:57.0826420Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:57.0827793Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3133379360>} 2025-05-07T20:32:57.0829188Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:57.0830204Z context = 2025-05-07T20:32:57.0830545Z 2025-05-07T20:32:57.0830719Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:57.0831243Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:57.0831705Z module_map=module_map) 2025-05-07T20:32:57.0832075Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:57.0832429Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:57.0832693Z E ^ 2025-05-07T20:32:57.0833160Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:57.0833653Z 2025-05-07T20:32:57.0834107Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:57.0834626Z 2025-05-07T20:32:57.0834736Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:57.0835144Z self=, 2025-05-07T20:32:57.0835551Z T=128, 2025-05-07T20:32:57.0835749Z D=5120, 2025-05-07T20:32:57.0835952Z scale_ub=1200.0, 2025-05-07T20:32:57.0836177Z contiguous=False, 2025-05-07T20:32:57.0836409Z compiled=True, 2025-05-07T20:32:57.0836618Z ) 2025-05-07T20:32:57.1986244Z self = 2025-05-07T20:32:57.1987299Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:57.1987600Z 2025-05-07T20:32:57.1987697Z @given( 2025-05-07T20:32:57.1987938Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:57.1988250Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:57.1988558Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:57.1988884Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:57.1989215Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:57.1989494Z ) 2025-05-07T20:32:57.1990076Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:57.1990524Z def test_silu_mul_quant( 2025-05-07T20:32:57.1990766Z self, 2025-05-07T20:32:57.1990958Z T: int, 2025-05-07T20:32:57.1991158Z D: int, 2025-05-07T20:32:57.1991382Z scale_ub: Optional[float], 2025-05-07T20:32:57.1991649Z contiguous: bool, 2025-05-07T20:32:57.1991887Z compiled: bool, 2025-05-07T20:32:57.1992113Z ) -> None: 2025-05-07T20:32:57.1992328Z torch.manual_seed(2025) 2025-05-07T20:32:57.1992571Z 2025-05-07T20:32:57.1992851Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:57.1993187Z 2025-05-07T20:32:57.1993390Z x_sign = torch.sign(x) 2025-05-07T20:32:57.1993684Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:57.1993991Z x = x_sign * x_clamp 2025-05-07T20:32:57.1994229Z x0 = x[:, :D] 2025-05-07T20:32:57.1994446Z x1 = x[:, D:] 2025-05-07T20:32:57.1994651Z 2025-05-07T20:32:57.1994841Z if contiguous: 2025-05-07T20:32:57.1995077Z x0 = x0.contiguous() 2025-05-07T20:32:57.1995333Z x1 = x1.contiguous() 2025-05-07T20:32:57.1995576Z 2025-05-07T20:32:57.1995774Z if scale_ub is not None: 2025-05-07T20:32:57.1996047Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:57.1996375Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:57.1996678Z ) 2025-05-07T20:32:57.1996870Z else: 2025-05-07T20:32:57.1997079Z scale_ub_tensor = None 2025-05-07T20:32:57.1997330Z 2025-05-07T20:32:57.1997595Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:57.1997922Z op = silu_mul_quant 2025-05-07T20:32:57.1998176Z if compiled: 2025-05-07T20:32:57.1998425Z op = torch.compile(op) 2025-05-07T20:32:57.1998719Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:57.1999064Z 2025-05-07T20:32:57.1999265Z > y_fp8, y_scale = fn() 2025-05-07T20:32:57.1999430Z 2025-05-07T20:32:57.1999533Z moe/activation_test.py:117: 2025-05-07T20:32:57.1999827Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:57.2000156Z moe/activation_test.py:115: in fn 2025-05-07T20:32:57.2000439Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:57.2000988Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:57.2001541Z return fn(*args, **kwargs) 2025-05-07T20:32:57.2002319Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:57.2003008Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:57.2003539Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:57.2004214Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:57.2004874Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:57.2005395Z kernel = self.compile( 2025-05-07T20:32:57.2005936Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:57.2006586Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:57.2006974Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:57.2007204Z 2025-05-07T20:32:57.2007412Z self = 2025-05-07T20:32:57.2008489Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:57.2009890Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f313337a290>} 2025-05-07T20:32:57.2011239Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:57.2012259Z context = 2025-05-07T20:32:57.2012547Z 2025-05-07T20:32:57.2012719Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:57.2013241Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:57.2013701Z module_map=module_map) 2025-05-07T20:32:57.2014062Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:57.2014416Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:57.2014672Z E ^ 2025-05-07T20:32:57.2015129Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:57.2015575Z 2025-05-07T20:32:57.2015986Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:57.2016494Z 2025-05-07T20:32:57.2016597Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:57.2017006Z self=, 2025-05-07T20:32:57.2017397Z T=16384, 2025-05-07T20:32:57.2017584Z D=7168, 2025-05-07T20:32:57.2017781Z scale_ub=1200.0, 2025-05-07T20:32:57.2018004Z contiguous=True, 2025-05-07T20:32:57.2018226Z compiled=True, 2025-05-07T20:32:57.2018426Z ) 2025-05-07T20:32:57.2018736Z self = 2025-05-07T20:32:57.2019233Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:57.2019564Z 2025-05-07T20:32:57.2019640Z @given( 2025-05-07T20:32:57.2019959Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:57.2020258Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:57.2020561Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:57.2020884Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:57.2021203Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:57.2021486Z ) 2025-05-07T20:32:57.2021834Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:57.2022315Z def test_silu_mul_quant( 2025-05-07T20:32:57.2022557Z self, 2025-05-07T20:32:57.2022796Z T: int, 2025-05-07T20:32:57.2022990Z D: int, 2025-05-07T20:32:57.2023208Z scale_ub: Optional[float], 2025-05-07T20:32:57.2023480Z contiguous: bool, 2025-05-07T20:32:57.2023722Z compiled: bool, 2025-05-07T20:32:57.2023946Z ) -> None: 2025-05-07T20:32:57.2024164Z torch.manual_seed(2025) 2025-05-07T20:32:57.2024407Z 2025-05-07T20:32:57.2024671Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:57.2025012Z 2025-05-07T20:32:57.2025204Z x_sign = torch.sign(x) 2025-05-07T20:32:57.2025491Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:57.2025801Z x = x_sign * x_clamp 2025-05-07T20:32:57.2026040Z x0 = x[:, :D] 2025-05-07T20:32:57.2026252Z x1 = x[:, D:] 2025-05-07T20:32:57.2026459Z 2025-05-07T20:32:57.2026652Z if contiguous: 2025-05-07T20:32:57.2026879Z x0 = x0.contiguous() 2025-05-07T20:32:57.2027138Z x1 = x1.contiguous() 2025-05-07T20:32:57.2027378Z 2025-05-07T20:32:57.2027591Z if scale_ub is not None: 2025-05-07T20:32:57.2027891Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:57.2028226Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:57.2028535Z ) 2025-05-07T20:32:57.2028774Z else: 2025-05-07T20:32:57.2028988Z scale_ub_tensor = None 2025-05-07T20:32:57.2029238Z 2025-05-07T20:32:57.2029462Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:57.2029775Z op = silu_mul_quant 2025-05-07T20:32:57.2030036Z if compiled: 2025-05-07T20:32:57.2030285Z op = torch.compile(op) 2025-05-07T20:32:57.2030579Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:57.2030847Z 2025-05-07T20:32:57.2031039Z > y_fp8, y_scale = fn() 2025-05-07T20:32:57.2031207Z 2025-05-07T20:32:57.2031308Z moe/activation_test.py:117: 2025-05-07T20:32:57.2031603Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:57.2031926Z moe/activation_test.py:115: in fn 2025-05-07T20:32:57.2032205Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:57.2032755Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:57.2033316Z return fn(*args, **kwargs) 2025-05-07T20:32:57.2033963Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:57.2034643Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:57.2035172Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:57.2035842Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:57.2036505Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:57.2037032Z kernel = self.compile( 2025-05-07T20:32:57.2037567Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:57.2038211Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:57.2038658Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:57.2038887Z 2025-05-07T20:32:57.2039095Z self = 2025-05-07T20:32:57.2040162Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:57.2041507Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f313337ad40>} 2025-05-07T20:32:57.2042936Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:57.2043966Z context = 2025-05-07T20:32:57.2044253Z 2025-05-07T20:32:57.2044422Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:57.2044943Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:57.2045401Z module_map=module_map) 2025-05-07T20:32:57.2045764Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:57.2046112Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:57.2046361Z E ^ 2025-05-07T20:32:57.2046825Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:57.2047271Z 2025-05-07T20:32:57.2047687Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:57.2048202Z 2025-05-07T20:32:57.3420397Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:57.3420919Z self=, 2025-05-07T20:32:57.3427905Z T=16384, 2025-05-07T20:32:57.3428142Z D=5120, 2025-05-07T20:32:57.3428362Z scale_ub=1200.0, 2025-05-07T20:32:57.3428598Z contiguous=True, 2025-05-07T20:32:57.3428828Z compiled=False, 2025-05-07T20:32:57.3429040Z ) 2025-05-07T20:32:57.3429368Z self = 2025-05-07T20:32:57.3429871Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:57.3430170Z 2025-05-07T20:32:57.3430251Z @given( 2025-05-07T20:32:57.3430494Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:57.3430819Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:57.3431123Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:57.3431458Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:57.3431795Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:57.3432082Z ) 2025-05-07T20:32:57.3432439Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:57.3432880Z def test_silu_mul_quant( 2025-05-07T20:32:57.3433122Z self, 2025-05-07T20:32:57.3433326Z T: int, 2025-05-07T20:32:57.3433533Z D: int, 2025-05-07T20:32:57.3433755Z scale_ub: Optional[float], 2025-05-07T20:32:57.3434034Z contiguous: bool, 2025-05-07T20:32:57.3434286Z compiled: bool, 2025-05-07T20:32:57.3434512Z ) -> None: 2025-05-07T20:32:57.3434737Z torch.manual_seed(2025) 2025-05-07T20:32:57.3434984Z 2025-05-07T20:32:57.3435262Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:57.3435608Z 2025-05-07T20:32:57.3435809Z x_sign = torch.sign(x) 2025-05-07T20:32:57.3436107Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:57.3436418Z x = x_sign * x_clamp 2025-05-07T20:32:57.3436780Z x0 = x[:, :D] 2025-05-07T20:32:57.3436999Z x1 = x[:, D:] 2025-05-07T20:32:57.3437215Z 2025-05-07T20:32:57.3437408Z if contiguous: 2025-05-07T20:32:57.3437650Z x0 = x0.contiguous() 2025-05-07T20:32:57.3437909Z x1 = x1.contiguous() 2025-05-07T20:32:57.3438149Z 2025-05-07T20:32:57.3438346Z if scale_ub is not None: 2025-05-07T20:32:57.3438618Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:57.3438956Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:57.3439267Z ) 2025-05-07T20:32:57.3439548Z else: 2025-05-07T20:32:57.3439766Z scale_ub_tensor = None 2025-05-07T20:32:57.3440019Z 2025-05-07T20:32:57.3440309Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:57.3440630Z op = silu_mul_quant 2025-05-07T20:32:57.3440881Z if compiled: 2025-05-07T20:32:57.3441135Z op = torch.compile(op) 2025-05-07T20:32:57.3441436Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:57.3441721Z 2025-05-07T20:32:57.3441921Z > y_fp8, y_scale = fn() 2025-05-07T20:32:57.3442090Z 2025-05-07T20:32:57.3442191Z moe/activation_test.py:117: 2025-05-07T20:32:57.3442498Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:57.3442831Z moe/activation_test.py:115: in fn 2025-05-07T20:32:57.3443112Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:57.3443798Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:57.3444490Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:57.3445039Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:57.3445723Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:57.3446430Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:57.3446967Z kernel = self.compile( 2025-05-07T20:32:57.3447518Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:57.3448214Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:57.3448610Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:57.3448834Z 2025-05-07T20:32:57.3449049Z self = 2025-05-07T20:32:57.3450126Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:57.3451510Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f313337bac0>} 2025-05-07T20:32:57.3452860Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:57.3453870Z context = 2025-05-07T20:32:57.3454163Z 2025-05-07T20:32:57.3454331Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:57.3454849Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:57.3455317Z module_map=module_map) 2025-05-07T20:32:57.3455682Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:57.3456039Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:57.3456296Z E ^ 2025-05-07T20:32:57.3456759Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:57.3457255Z 2025-05-07T20:32:57.3457668Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:57.3458192Z 2025-05-07T20:32:57.3458299Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:57.3458715Z self=, 2025-05-07T20:32:57.3459109Z T=1, 2025-05-07T20:32:57.3459297Z D=7168, 2025-05-07T20:32:57.3459495Z scale_ub=1200.0, 2025-05-07T20:32:57.3459867Z contiguous=False, 2025-05-07T20:32:57.3460092Z compiled=False, 2025-05-07T20:32:57.3460294Z ) 2025-05-07T20:32:57.3460648Z self = 2025-05-07T20:32:57.3461132Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:57.3461393Z 2025-05-07T20:32:57.3461471Z @given( 2025-05-07T20:32:57.3461701Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:57.3462043Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:57.3462345Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:57.3462674Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:57.3463000Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:57.3463277Z ) 2025-05-07T20:32:57.3463622Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:57.3464061Z def test_silu_mul_quant( 2025-05-07T20:32:57.3464297Z self, 2025-05-07T20:32:57.3464486Z T: int, 2025-05-07T20:32:57.3464681Z D: int, 2025-05-07T20:32:57.3464897Z scale_ub: Optional[float], 2025-05-07T20:32:57.3465167Z contiguous: bool, 2025-05-07T20:32:57.3465407Z compiled: bool, 2025-05-07T20:32:57.3465624Z ) -> None: 2025-05-07T20:32:57.3465843Z torch.manual_seed(2025) 2025-05-07T20:32:57.3466080Z 2025-05-07T20:32:57.3466395Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:57.3466736Z 2025-05-07T20:32:57.3466927Z x_sign = torch.sign(x) 2025-05-07T20:32:57.3467218Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:57.3467519Z x = x_sign * x_clamp 2025-05-07T20:32:57.3467756Z x0 = x[:, :D] 2025-05-07T20:32:57.3467971Z x1 = x[:, D:] 2025-05-07T20:32:57.3468172Z 2025-05-07T20:32:57.3468356Z if contiguous: 2025-05-07T20:32:57.3468582Z x0 = x0.contiguous() 2025-05-07T20:32:57.3468833Z x1 = x1.contiguous() 2025-05-07T20:32:57.3469071Z 2025-05-07T20:32:57.3469265Z if scale_ub is not None: 2025-05-07T20:32:57.3469536Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:57.3469866Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:57.3470164Z ) 2025-05-07T20:32:57.3470349Z else: 2025-05-07T20:32:57.3470563Z scale_ub_tensor = None 2025-05-07T20:32:57.3470817Z 2025-05-07T20:32:57.3471049Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:57.3471361Z op = silu_mul_quant 2025-05-07T20:32:57.3471620Z if compiled: 2025-05-07T20:32:57.3471868Z op = torch.compile(op) 2025-05-07T20:32:57.3472159Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:57.3472425Z 2025-05-07T20:32:57.3472614Z > y_fp8, y_scale = fn() 2025-05-07T20:32:57.3472776Z 2025-05-07T20:32:57.3472874Z moe/activation_test.py:117: 2025-05-07T20:32:57.3473172Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:57.3473505Z moe/activation_test.py:115: in fn 2025-05-07T20:32:57.3473778Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:57.3474457Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:57.3475196Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:57.3475732Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:57.3476409Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:57.3477065Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:57.3477627Z kernel = self.compile( 2025-05-07T20:32:57.3478168Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:57.3478867Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:57.3479298Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:57.3479523Z 2025-05-07T20:32:57.3479734Z self = 2025-05-07T20:32:57.3480796Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:57.3482151Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3132f9c4c0>} 2025-05-07T20:32:57.3483475Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:57.3484508Z context = 2025-05-07T20:32:57.3484790Z 2025-05-07T20:32:57.3484960Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:57.3485470Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:57.3485987Z module_map=module_map) 2025-05-07T20:32:57.3486357Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:57.3486700Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:57.3486954Z E ^ 2025-05-07T20:32:57.3487416Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:57.3487858Z 2025-05-07T20:32:57.3488273Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:57.3488780Z 2025-05-07T20:32:57.5397819Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:57.5398334Z self=, 2025-05-07T20:32:57.5398740Z T=4096, 2025-05-07T20:32:57.5398932Z D=7168, 2025-05-07T20:32:57.5399126Z scale_ub=1200.0, 2025-05-07T20:32:57.5399353Z contiguous=False, 2025-05-07T20:32:57.5399578Z compiled=True, 2025-05-07T20:32:57.5399781Z ) 2025-05-07T20:32:57.5400108Z self = 2025-05-07T20:32:57.5400600Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:57.5400869Z 2025-05-07T20:32:57.5400952Z @given( 2025-05-07T20:32:57.5401180Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:57.5401498Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:57.5401807Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:57.5402132Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:57.5402462Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:57.5402748Z ) 2025-05-07T20:32:57.5403097Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:57.5403536Z def test_silu_mul_quant( 2025-05-07T20:32:57.5403782Z self, 2025-05-07T20:32:57.5403983Z T: int, 2025-05-07T20:32:57.5404351Z D: int, 2025-05-07T20:32:57.5404578Z scale_ub: Optional[float], 2025-05-07T20:32:57.5404848Z contiguous: bool, 2025-05-07T20:32:57.5405089Z compiled: bool, 2025-05-07T20:32:57.5405319Z ) -> None: 2025-05-07T20:32:57.5405537Z torch.manual_seed(2025) 2025-05-07T20:32:57.5405780Z 2025-05-07T20:32:57.5406053Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:57.5406400Z 2025-05-07T20:32:57.5406593Z x_sign = torch.sign(x) 2025-05-07T20:32:57.5406890Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:57.5407275Z x = x_sign * x_clamp 2025-05-07T20:32:57.5407520Z x0 = x[:, :D] 2025-05-07T20:32:57.5407825Z x1 = x[:, D:] 2025-05-07T20:32:57.5408045Z 2025-05-07T20:32:57.5408230Z if contiguous: 2025-05-07T20:32:57.5408460Z x0 = x0.contiguous() 2025-05-07T20:32:57.5408720Z x1 = x1.contiguous() 2025-05-07T20:32:57.5408960Z 2025-05-07T20:32:57.5409153Z if scale_ub is not None: 2025-05-07T20:32:57.5409426Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:57.5409758Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:57.5410062Z ) 2025-05-07T20:32:57.5410261Z else: 2025-05-07T20:32:57.5410478Z scale_ub_tensor = None 2025-05-07T20:32:57.5410728Z 2025-05-07T20:32:57.5410967Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:57.5411278Z op = silu_mul_quant 2025-05-07T20:32:57.5411525Z if compiled: 2025-05-07T20:32:57.5411779Z op = torch.compile(op) 2025-05-07T20:32:57.5412083Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:57.5412356Z 2025-05-07T20:32:57.5412551Z > y_fp8, y_scale = fn() 2025-05-07T20:32:57.5412737Z 2025-05-07T20:32:57.5412840Z moe/activation_test.py:117: 2025-05-07T20:32:57.5413141Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:57.5413541Z moe/activation_test.py:115: in fn 2025-05-07T20:32:57.5413830Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:57.5414392Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:57.5414957Z return fn(*args, **kwargs) 2025-05-07T20:32:57.5415609Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:57.5416295Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:57.5416831Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:57.5417510Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:57.5418167Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:57.5418700Z kernel = self.compile( 2025-05-07T20:32:57.5419252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:57.5419980Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:57.5420380Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:57.5420612Z 2025-05-07T20:32:57.5420824Z self = 2025-05-07T20:32:57.5421902Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:57.5423280Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3132f9d1b0>} 2025-05-07T20:32:57.5424634Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:57.5425724Z context = 2025-05-07T20:32:57.5426011Z 2025-05-07T20:32:57.5426184Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:57.5426707Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:57.5427172Z module_map=module_map) 2025-05-07T20:32:57.5427589Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:57.5427990Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:57.5428254Z E ^ 2025-05-07T20:32:57.5428725Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:57.5429169Z 2025-05-07T20:32:57.5429592Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:57.5430114Z 2025-05-07T20:32:57.5430227Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:57.5430641Z self=, 2025-05-07T20:32:57.5431053Z T=128, 2025-05-07T20:32:57.5431250Z D=7168, 2025-05-07T20:32:57.5431446Z scale_ub=1200.0, 2025-05-07T20:32:57.5431677Z contiguous=False, 2025-05-07T20:32:57.5431908Z compiled=True, 2025-05-07T20:32:57.5432112Z ) 2025-05-07T20:32:57.6461381Z self = 2025-05-07T20:32:57.6462457Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:57.6463000Z 2025-05-07T20:32:57.6463156Z @given( 2025-05-07T20:32:57.6463623Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:57.6464235Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:57.6465032Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:57.6465697Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:57.6466344Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:57.6466917Z ) 2025-05-07T20:32:57.6467620Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:57.6468118Z def test_silu_mul_quant( 2025-05-07T20:32:57.6468356Z self, 2025-05-07T20:32:57.6468552Z T: int, 2025-05-07T20:32:57.6468752Z D: int, 2025-05-07T20:32:57.6468976Z scale_ub: Optional[float], 2025-05-07T20:32:57.6469250Z contiguous: bool, 2025-05-07T20:32:57.6469495Z compiled: bool, 2025-05-07T20:32:57.6469724Z ) -> None: 2025-05-07T20:32:57.6469947Z torch.manual_seed(2025) 2025-05-07T20:32:57.6470193Z 2025-05-07T20:32:57.6470466Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:57.6470809Z 2025-05-07T20:32:57.6471014Z x_sign = torch.sign(x) 2025-05-07T20:32:57.6471305Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:57.6471620Z x = x_sign * x_clamp 2025-05-07T20:32:57.6471872Z x0 = x[:, :D] 2025-05-07T20:32:57.6472090Z x1 = x[:, D:] 2025-05-07T20:32:57.6472304Z 2025-05-07T20:32:57.6472491Z if contiguous: 2025-05-07T20:32:57.6472723Z x0 = x0.contiguous() 2025-05-07T20:32:57.6472977Z x1 = x1.contiguous() 2025-05-07T20:32:57.6473213Z 2025-05-07T20:32:57.6473410Z if scale_ub is not None: 2025-05-07T20:32:57.6473680Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:57.6474018Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:57.6474326Z ) 2025-05-07T20:32:57.6474519Z else: 2025-05-07T20:32:57.6474735Z scale_ub_tensor = None 2025-05-07T20:32:57.6474987Z 2025-05-07T20:32:57.6475216Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:57.6475612Z op = silu_mul_quant 2025-05-07T20:32:57.6475864Z if compiled: 2025-05-07T20:32:57.6476109Z op = torch.compile(op) 2025-05-07T20:32:57.6476405Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:57.6476672Z 2025-05-07T20:32:57.6476862Z > y_fp8, y_scale = fn() 2025-05-07T20:32:57.6477030Z 2025-05-07T20:32:57.6477132Z moe/activation_test.py:117: 2025-05-07T20:32:57.6477432Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:57.6477879Z moe/activation_test.py:115: in fn 2025-05-07T20:32:57.6478156Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:57.6478768Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:57.6479334Z return fn(*args, **kwargs) 2025-05-07T20:32:57.6479985Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:57.6480673Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:57.6481206Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:57.6481880Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:57.6482531Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:57.6483057Z kernel = self.compile( 2025-05-07T20:32:57.6483593Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:57.6484243Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:57.6484640Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:57.6484869Z 2025-05-07T20:32:57.6485074Z self = 2025-05-07T20:32:57.6486188Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:57.6487559Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3132f9c0d0>} 2025-05-07T20:32:57.6488888Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:57.6490075Z context = 2025-05-07T20:32:57.6490372Z 2025-05-07T20:32:57.6490540Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:57.6491058Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:57.6491528Z module_map=module_map) 2025-05-07T20:32:57.6491893Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:57.6492246Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:57.6492498Z E ^ 2025-05-07T20:32:57.6492962Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:57.6493411Z 2025-05-07T20:32:57.6493825Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:57.6494334Z 2025-05-07T20:32:57.6494445Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:57.6494851Z self=, 2025-05-07T20:32:57.6495248Z T=2048, 2025-05-07T20:32:57.6495435Z D=7168, 2025-05-07T20:32:57.6495623Z scale_ub=None, 2025-05-07T20:32:57.6495926Z contiguous=True, 2025-05-07T20:32:57.6496166Z compiled=True, 2025-05-07T20:32:57.6496373Z ) 2025-05-07T20:32:57.6496694Z self = 2025-05-07T20:32:57.6497182Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:57.6497445Z 2025-05-07T20:32:57.6497523Z @given( 2025-05-07T20:32:57.6497750Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:57.6498061Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:57.6498366Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:57.6498765Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:57.6499150Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:57.6499437Z ) 2025-05-07T20:32:57.6499878Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:57.6500317Z def test_silu_mul_quant( 2025-05-07T20:32:57.6500558Z self, 2025-05-07T20:32:57.6500759Z T: int, 2025-05-07T20:32:57.6500956Z D: int, 2025-05-07T20:32:57.6501178Z scale_ub: Optional[float], 2025-05-07T20:32:57.6501449Z contiguous: bool, 2025-05-07T20:32:57.6501687Z compiled: bool, 2025-05-07T20:32:57.6501913Z ) -> None: 2025-05-07T20:32:57.6502136Z torch.manual_seed(2025) 2025-05-07T20:32:57.6502373Z 2025-05-07T20:32:57.6502646Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:57.6502988Z 2025-05-07T20:32:57.6503177Z x_sign = torch.sign(x) 2025-05-07T20:32:57.6503474Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:57.6503785Z x = x_sign * x_clamp 2025-05-07T20:32:57.6504031Z x0 = x[:, :D] 2025-05-07T20:32:57.6504249Z x1 = x[:, D:] 2025-05-07T20:32:57.6504457Z 2025-05-07T20:32:57.6504638Z if contiguous: 2025-05-07T20:32:57.6504874Z x0 = x0.contiguous() 2025-05-07T20:32:57.6505137Z x1 = x1.contiguous() 2025-05-07T20:32:57.6505437Z 2025-05-07T20:32:57.6505645Z if scale_ub is not None: 2025-05-07T20:32:57.6505926Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:57.6511799Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:57.6512133Z ) 2025-05-07T20:32:57.6512345Z else: 2025-05-07T20:32:57.6512562Z scale_ub_tensor = None 2025-05-07T20:32:57.6512835Z 2025-05-07T20:32:57.6513083Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:57.6513404Z op = silu_mul_quant 2025-05-07T20:32:57.6513686Z if compiled: 2025-05-07T20:32:57.6513955Z op = torch.compile(op) 2025-05-07T20:32:57.6514265Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:57.6514551Z 2025-05-07T20:32:57.6514755Z > y_fp8, y_scale = fn() 2025-05-07T20:32:57.6514931Z 2025-05-07T20:32:57.6515036Z moe/activation_test.py:117: 2025-05-07T20:32:57.6515356Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:57.6515697Z moe/activation_test.py:115: in fn 2025-05-07T20:32:57.6515988Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:57.6516550Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:57.6517118Z return fn(*args, **kwargs) 2025-05-07T20:32:57.6517788Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:57.6518483Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:57.6519031Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:57.6519722Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:57.6520394Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:57.6521112Z kernel = self.compile( 2025-05-07T20:32:57.6521759Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:57.6522554Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:57.6523010Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:57.6523287Z 2025-05-07T20:32:57.6523519Z self = 2025-05-07T20:32:57.6524897Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:57.6526320Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3132f9e560>} 2025-05-07T20:32:57.6527693Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:57.6528713Z context = 2025-05-07T20:32:57.6529007Z 2025-05-07T20:32:57.6529176Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:57.6529709Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:57.6530193Z module_map=module_map) 2025-05-07T20:32:57.6530566Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:57.6530932Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:57.6531199Z E ^ 2025-05-07T20:32:57.6531669Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:57.6532125Z 2025-05-07T20:32:57.6532589Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:57.6533114Z 2025-05-07T20:32:57.7326668Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:57.7327454Z self=, 2025-05-07T20:32:57.7328172Z T=16384, 2025-05-07T20:32:57.7328372Z D=5120, 2025-05-07T20:32:57.7328577Z scale_ub=None, 2025-05-07T20:32:57.7328809Z contiguous=False, 2025-05-07T20:32:57.7329044Z compiled=False, 2025-05-07T20:32:57.7329256Z ) 2025-05-07T20:32:57.7329590Z self = 2025-05-07T20:32:57.7330091Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:57.7330377Z 2025-05-07T20:32:57.7330455Z @given( 2025-05-07T20:32:57.7330699Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:57.7331025Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:57.7331335Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:57.7331673Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:57.7332006Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:57.7332296Z ) 2025-05-07T20:32:57.7332653Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:57.7333097Z def test_silu_mul_quant( 2025-05-07T20:32:57.7333342Z self, 2025-05-07T20:32:57.7333548Z T: int, 2025-05-07T20:32:57.7333752Z D: int, 2025-05-07T20:32:57.7333974Z scale_ub: Optional[float], 2025-05-07T20:32:57.7334254Z contiguous: bool, 2025-05-07T20:32:57.7334504Z compiled: bool, 2025-05-07T20:32:57.7334728Z ) -> None: 2025-05-07T20:32:57.7334962Z torch.manual_seed(2025) 2025-05-07T20:32:57.7335208Z 2025-05-07T20:32:57.7335490Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:57.7335947Z 2025-05-07T20:32:57.7336146Z x_sign = torch.sign(x) 2025-05-07T20:32:57.7336445Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:57.7338515Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:57.7340534Z 2025-05-07T20:32:57.7340656Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:57.7340876Z 2025-05-07T20:32:57.7340983Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:57.7341412Z self=, 2025-05-07T20:32:57.7341814Z T=4096, 2025-05-07T20:32:57.7342008Z D=7168, 2025-05-07T20:32:57.7342206Z scale_ub=1200.0, 2025-05-07T20:32:57.7342433Z contiguous=True, 2025-05-07T20:32:57.7342667Z compiled=True, 2025-05-07T20:32:57.7342874Z ) 2025-05-07T20:32:57.7343189Z self = 2025-05-07T20:32:57.7343694Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:57.7343965Z 2025-05-07T20:32:57.7344050Z @given( 2025-05-07T20:32:57.7344280Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:57.7344597Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:57.7344908Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:57.7345242Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:57.7345570Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:57.7345860Z ) 2025-05-07T20:32:57.7346287Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:57.7346729Z def test_silu_mul_quant( 2025-05-07T20:32:57.7346973Z self, 2025-05-07T20:32:57.7347170Z T: int, 2025-05-07T20:32:57.7347367Z D: int, 2025-05-07T20:32:57.7347591Z scale_ub: Optional[float], 2025-05-07T20:32:57.7347876Z contiguous: bool, 2025-05-07T20:32:57.7348115Z compiled: bool, 2025-05-07T20:32:57.7348338Z ) -> None: 2025-05-07T20:32:57.7348567Z torch.manual_seed(2025) 2025-05-07T20:32:57.7348838Z 2025-05-07T20:32:57.7349111Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:57.7349458Z 2025-05-07T20:32:57.7349655Z x_sign = torch.sign(x) 2025-05-07T20:32:57.7349951Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:57.7351943Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:57.7353801Z 2025-05-07T20:32:57.7353925Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:57.7354144Z 2025-05-07T20:32:57.7354253Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:57.7354673Z self=, 2025-05-07T20:32:57.7355069Z T=16384, 2025-05-07T20:32:57.7355268Z D=7168, 2025-05-07T20:32:57.7355466Z scale_ub=None, 2025-05-07T20:32:57.7355684Z contiguous=False, 2025-05-07T20:32:57.7355958Z compiled=False, 2025-05-07T20:32:57.7356169Z ) 2025-05-07T20:32:57.7356487Z self = 2025-05-07T20:32:57.7356978Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:57.7357257Z 2025-05-07T20:32:57.7357337Z @given( 2025-05-07T20:32:57.7357565Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:57.7357875Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:57.7358188Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:57.7358576Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:57.7358902Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:57.7359230Z ) 2025-05-07T20:32:57.7359585Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:57.7360031Z def test_silu_mul_quant( 2025-05-07T20:32:57.7360272Z self, 2025-05-07T20:32:57.7360470Z T: int, 2025-05-07T20:32:57.7360676Z D: int, 2025-05-07T20:32:57.7360896Z scale_ub: Optional[float], 2025-05-07T20:32:57.7361174Z contiguous: bool, 2025-05-07T20:32:57.7361416Z compiled: bool, 2025-05-07T20:32:57.7361638Z ) -> None: 2025-05-07T20:32:57.7361862Z torch.manual_seed(2025) 2025-05-07T20:32:57.7362105Z 2025-05-07T20:32:57.7362374Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:57.7364412Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:57.7366349Z 2025-05-07T20:32:57.7366474Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:57.7366690Z 2025-05-07T20:32:57.7366795Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:57.7367209Z self=, 2025-05-07T20:32:57.7367617Z T=2048, 2025-05-07T20:32:57.7367836Z D=7168, 2025-05-07T20:32:57.7368043Z scale_ub=1200.0, 2025-05-07T20:32:57.7368272Z contiguous=True, 2025-05-07T20:32:57.7368497Z compiled=True, 2025-05-07T20:32:57.7368706Z ) 2025-05-07T20:32:57.7369020Z self = 2025-05-07T20:32:57.7369513Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:57.7369787Z 2025-05-07T20:32:57.7369866Z @given( 2025-05-07T20:32:57.7370101Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:57.7370407Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:57.7370726Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:57.7371061Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:57.7371389Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:57.7371677Z ) 2025-05-07T20:32:57.7372030Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:57.7372471Z def test_silu_mul_quant( 2025-05-07T20:32:57.7372719Z self, 2025-05-07T20:32:57.7372917Z T: int, 2025-05-07T20:32:57.7373116Z D: int, 2025-05-07T20:32:57.7373341Z scale_ub: Optional[float], 2025-05-07T20:32:57.7373613Z contiguous: bool, 2025-05-07T20:32:57.7373853Z compiled: bool, 2025-05-07T20:32:57.7374079Z ) -> None: 2025-05-07T20:32:57.7374298Z torch.manual_seed(2025) 2025-05-07T20:32:57.7374539Z 2025-05-07T20:32:57.7374811Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:57.7375197Z 2025-05-07T20:32:57.7375398Z x_sign = torch.sign(x) 2025-05-07T20:32:57.7375691Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:57.7377707Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:57.7379602Z 2025-05-07T20:32:57.7379727Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:57.7380005Z 2025-05-07T20:32:57.7380116Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:57.7380532Z self=, 2025-05-07T20:32:57.7380927Z T=2048, 2025-05-07T20:32:57.7381119Z D=7168, 2025-05-07T20:32:57.7381315Z scale_ub=None, 2025-05-07T20:32:57.7381528Z contiguous=True, 2025-05-07T20:32:57.7381761Z compiled=False, 2025-05-07T20:32:57.7381965Z ) 2025-05-07T20:32:58.0442526Z self = 2025-05-07T20:32:58.0443076Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:58.0443355Z 2025-05-07T20:32:58.0443441Z @given( 2025-05-07T20:32:58.0443668Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.0443988Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.0444292Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.0444621Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.0444947Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.0445236Z ) 2025-05-07T20:32:58.0445694Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.0446149Z def test_silu_mul_quant( 2025-05-07T20:32:58.0446408Z self, 2025-05-07T20:32:58.0446611Z T: int, 2025-05-07T20:32:58.0446811Z D: int, 2025-05-07T20:32:58.0447032Z scale_ub: Optional[float], 2025-05-07T20:32:58.0447302Z contiguous: bool, 2025-05-07T20:32:58.0447545Z compiled: bool, 2025-05-07T20:32:58.0447776Z ) -> None: 2025-05-07T20:32:58.0447993Z torch.manual_seed(2025) 2025-05-07T20:32:58.0448241Z 2025-05-07T20:32:58.0448518Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.0448860Z 2025-05-07T20:32:58.0449057Z > x_sign = torch.sign(x) 2025-05-07T20:32:58.0450987Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:58.0452822Z 2025-05-07T20:32:58.0452946Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:58.0453160Z 2025-05-07T20:32:58.0453266Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.0453680Z self=, 2025-05-07T20:32:58.0454090Z T=1, 2025-05-07T20:32:58.0454281Z D=7168, 2025-05-07T20:32:58.0454479Z scale_ub=1200.0, 2025-05-07T20:32:58.0454710Z contiguous=True, 2025-05-07T20:32:58.0454940Z compiled=False, 2025-05-07T20:32:58.0455146Z ) 2025-05-07T20:32:58.0455466Z self = 2025-05-07T20:32:58.0456026Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:58.0456290Z 2025-05-07T20:32:58.0456372Z @given( 2025-05-07T20:32:58.0456605Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.0456916Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.0457220Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.0457550Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.0457910Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.0458291Z ) 2025-05-07T20:32:58.0458720Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.0459159Z def test_silu_mul_quant( 2025-05-07T20:32:58.0459405Z self, 2025-05-07T20:32:58.0459607Z T: int, 2025-05-07T20:32:58.0459886Z D: int, 2025-05-07T20:32:58.0460118Z scale_ub: Optional[float], 2025-05-07T20:32:58.0460401Z contiguous: bool, 2025-05-07T20:32:58.0460642Z compiled: bool, 2025-05-07T20:32:58.0460871Z ) -> None: 2025-05-07T20:32:58.0461092Z torch.manual_seed(2025) 2025-05-07T20:32:58.0461333Z 2025-05-07T20:32:58.0461612Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.0461955Z 2025-05-07T20:32:58.0462152Z x_sign = torch.sign(x) 2025-05-07T20:32:58.0462447Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:58.0462766Z x = x_sign * x_clamp 2025-05-07T20:32:58.0463013Z x0 = x[:, :D] 2025-05-07T20:32:58.0463238Z x1 = x[:, D:] 2025-05-07T20:32:58.0463451Z 2025-05-07T20:32:58.0463645Z if contiguous: 2025-05-07T20:32:58.0463878Z x0 = x0.contiguous() 2025-05-07T20:32:58.0464138Z x1 = x1.contiguous() 2025-05-07T20:32:58.0464379Z 2025-05-07T20:32:58.0464572Z if scale_ub is not None: 2025-05-07T20:32:58.0464844Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:58.0465234Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:58.0465545Z ) 2025-05-07T20:32:58.0465742Z else: 2025-05-07T20:32:58.0465957Z scale_ub_tensor = None 2025-05-07T20:32:58.0466203Z 2025-05-07T20:32:58.0466437Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:58.0466751Z op = silu_mul_quant 2025-05-07T20:32:58.0467001Z if compiled: 2025-05-07T20:32:58.0467257Z op = torch.compile(op) 2025-05-07T20:32:58.0467563Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.0467848Z 2025-05-07T20:32:58.0468052Z > y_fp8, y_scale = fn() 2025-05-07T20:32:58.0468225Z 2025-05-07T20:32:58.0468328Z moe/activation_test.py:117: 2025-05-07T20:32:58.0468624Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.0468953Z moe/activation_test.py:115: in fn 2025-05-07T20:32:58.0469240Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.0469931Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:58.0470618Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:58.0471154Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:58.0471833Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:58.0472491Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:58.0473020Z kernel = self.compile( 2025-05-07T20:32:58.0473561Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:58.0474213Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:58.0474605Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.0474884Z 2025-05-07T20:32:58.0475094Z self = 2025-05-07T20:32:58.0476164Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:58.0477535Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3132d5c4c0>} 2025-05-07T20:32:58.0478954Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:58.0479981Z context = 2025-05-07T20:32:58.0480274Z 2025-05-07T20:32:58.0480444Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:58.0480965Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:58.0481438Z module_map=module_map) 2025-05-07T20:32:58.0481807Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:58.0482167Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:58.0482432Z E ^ 2025-05-07T20:32:58.0482896Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:58.0483350Z 2025-05-07T20:32:58.0483769Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:58.0484278Z 2025-05-07T20:32:58.0484385Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.0484799Z self=, 2025-05-07T20:32:58.0485239Z T=128, 2025-05-07T20:32:58.0485430Z D=5120, 2025-05-07T20:32:58.0485625Z scale_ub=None, 2025-05-07T20:32:58.0485838Z contiguous=True, 2025-05-07T20:32:58.0486076Z compiled=False, 2025-05-07T20:32:58.0486282Z ) 2025-05-07T20:32:58.1262534Z self = 2025-05-07T20:32:58.1263109Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:58.1263406Z 2025-05-07T20:32:58.1263489Z @given( 2025-05-07T20:32:58.1263729Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.1264049Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.1264358Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.1264685Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.1265015Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.1265298Z ) 2025-05-07T20:32:58.1265645Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.1268639Z def test_silu_mul_quant( 2025-05-07T20:32:58.1268878Z self, 2025-05-07T20:32:58.1269067Z T: int, 2025-05-07T20:32:58.1269263Z D: int, 2025-05-07T20:32:58.1269483Z scale_ub: Optional[float], 2025-05-07T20:32:58.1269744Z contiguous: bool, 2025-05-07T20:32:58.1269978Z compiled: bool, 2025-05-07T20:32:58.1270201Z ) -> None: 2025-05-07T20:32:58.1270411Z torch.manual_seed(2025) 2025-05-07T20:32:58.1270647Z 2025-05-07T20:32:58.1270917Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.1271247Z 2025-05-07T20:32:58.1271433Z x_sign = torch.sign(x) 2025-05-07T20:32:58.1271720Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:58.1272026Z x = x_sign * x_clamp 2025-05-07T20:32:58.1272256Z x0 = x[:, :D] 2025-05-07T20:32:58.1272469Z x1 = x[:, D:] 2025-05-07T20:32:58.1272672Z 2025-05-07T20:32:58.1272852Z if contiguous: 2025-05-07T20:32:58.1273100Z x0 = x0.contiguous() 2025-05-07T20:32:58.1273346Z x1 = x1.contiguous() 2025-05-07T20:32:58.1273578Z 2025-05-07T20:32:58.1273766Z if scale_ub is not None: 2025-05-07T20:32:58.1274025Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:58.1274349Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:58.1274647Z ) 2025-05-07T20:32:58.1274831Z else: 2025-05-07T20:32:58.1275037Z scale_ub_tensor = None 2025-05-07T20:32:58.1275366Z 2025-05-07T20:32:58.1275591Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:58.1275960Z op = silu_mul_quant 2025-05-07T20:32:58.1276211Z if compiled: 2025-05-07T20:32:58.1276448Z op = torch.compile(op) 2025-05-07T20:32:58.1276735Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.1277001Z 2025-05-07T20:32:58.1277190Z > y_fp8, y_scale = fn() 2025-05-07T20:32:58.1277355Z 2025-05-07T20:32:58.1277456Z moe/activation_test.py:117: 2025-05-07T20:32:58.1277772Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.1278120Z moe/activation_test.py:115: in fn 2025-05-07T20:32:58.1278392Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.1279073Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:58.1279762Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:58.1280290Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:58.1280962Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:58.1281611Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:58.1282135Z kernel = self.compile( 2025-05-07T20:32:58.1282725Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:58.1283380Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:58.1283765Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.1283990Z 2025-05-07T20:32:58.1284196Z self = 2025-05-07T20:32:58.1285258Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:58.1286623Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3132d5c940>} 2025-05-07T20:32:58.1287944Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:58.1289058Z context = 2025-05-07T20:32:58.1289338Z 2025-05-07T20:32:58.1289500Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:58.1290191Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:58.1290658Z module_map=module_map) 2025-05-07T20:32:58.1291027Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:58.1296930Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:58.1297203Z E ^ 2025-05-07T20:32:58.1297678Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:58.1298125Z 2025-05-07T20:32:58.1298560Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:58.1299072Z 2025-05-07T20:32:58.1299177Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.1299586Z self=, 2025-05-07T20:32:58.1300098Z T=128, 2025-05-07T20:32:58.1300294Z D=7168, 2025-05-07T20:32:58.1300489Z scale_ub=None, 2025-05-07T20:32:58.1300711Z contiguous=True, 2025-05-07T20:32:58.1300940Z compiled=False, 2025-05-07T20:32:58.1301264Z ) 2025-05-07T20:32:58.1301589Z self = 2025-05-07T20:32:58.1302145Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:58.1302417Z 2025-05-07T20:32:58.1302496Z @given( 2025-05-07T20:32:58.1302728Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.1303046Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.1303351Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.1303693Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.1304030Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.1304317Z ) 2025-05-07T20:32:58.1304670Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.1305115Z def test_silu_mul_quant( 2025-05-07T20:32:58.1305365Z self, 2025-05-07T20:32:58.1305565Z T: int, 2025-05-07T20:32:58.1305775Z D: int, 2025-05-07T20:32:58.1306000Z scale_ub: Optional[float], 2025-05-07T20:32:58.1306272Z contiguous: bool, 2025-05-07T20:32:58.1306515Z compiled: bool, 2025-05-07T20:32:58.1306744Z ) -> None: 2025-05-07T20:32:58.1306960Z torch.manual_seed(2025) 2025-05-07T20:32:58.1307227Z 2025-05-07T20:32:58.1307504Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.1307894Z 2025-05-07T20:32:58.1308158Z x_sign = torch.sign(x) 2025-05-07T20:32:58.1308467Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:58.1308778Z x = x_sign * x_clamp 2025-05-07T20:32:58.1309014Z x0 = x[:, :D] 2025-05-07T20:32:58.1309236Z x1 = x[:, D:] 2025-05-07T20:32:58.1309445Z 2025-05-07T20:32:58.1309633Z if contiguous: 2025-05-07T20:32:58.1309868Z x0 = x0.contiguous() 2025-05-07T20:32:58.1310126Z x1 = x1.contiguous() 2025-05-07T20:32:58.1310366Z 2025-05-07T20:32:58.1310561Z if scale_ub is not None: 2025-05-07T20:32:58.1310839Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:58.1311174Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:58.1311482Z ) 2025-05-07T20:32:58.1311677Z else: 2025-05-07T20:32:58.1311895Z scale_ub_tensor = None 2025-05-07T20:32:58.1312145Z 2025-05-07T20:32:58.1312390Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:58.1312708Z op = silu_mul_quant 2025-05-07T20:32:58.1313054Z if compiled: 2025-05-07T20:32:58.1313306Z op = torch.compile(op) 2025-05-07T20:32:58.1313601Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.1313868Z 2025-05-07T20:32:58.1314065Z > y_fp8, y_scale = fn() 2025-05-07T20:32:58.1314237Z 2025-05-07T20:32:58.1314340Z moe/activation_test.py:117: 2025-05-07T20:32:58.1314637Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.1314967Z moe/activation_test.py:115: in fn 2025-05-07T20:32:58.1315250Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.1315942Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:58.1316624Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:58.1317160Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:58.1317846Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:58.1318515Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:58.1319039Z kernel = self.compile( 2025-05-07T20:32:58.1319594Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:58.1320245Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:58.1320693Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.1320958Z 2025-05-07T20:32:58.1321172Z self = 2025-05-07T20:32:58.1322249Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:58.1323624Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3132d5d240>} 2025-05-07T20:32:58.1324952Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:58.1325963Z context = 2025-05-07T20:32:58.1326249Z 2025-05-07T20:32:58.1326423Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:58.1326940Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:58.1327407Z module_map=module_map) 2025-05-07T20:32:58.1327823Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:58.1328183Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:58.1328447Z E ^ 2025-05-07T20:32:58.1328918Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:58.1329360Z 2025-05-07T20:32:58.1329777Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:58.1330285Z 2025-05-07T20:32:58.1330388Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.1330801Z self=, 2025-05-07T20:32:58.1331198Z T=2048, 2025-05-07T20:32:58.1331390Z D=7168, 2025-05-07T20:32:58.1331580Z scale_ub=1200.0, 2025-05-07T20:32:58.1331804Z contiguous=True, 2025-05-07T20:32:58.1332028Z compiled=False, 2025-05-07T20:32:58.1332230Z ) 2025-05-07T20:32:58.2285453Z self = 2025-05-07T20:32:58.2285995Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:58.2286404Z 2025-05-07T20:32:58.2286496Z @given( 2025-05-07T20:32:58.2286726Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.2287044Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.2287362Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.2287728Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.2288405Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.2288976Z ) 2025-05-07T20:32:58.2289675Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.2290824Z def test_silu_mul_quant( 2025-05-07T20:32:58.2291294Z self, 2025-05-07T20:32:58.2291675Z T: int, 2025-05-07T20:32:58.2292056Z D: int, 2025-05-07T20:32:58.2292488Z scale_ub: Optional[float], 2025-05-07T20:32:58.2293021Z contiguous: bool, 2025-05-07T20:32:58.2293487Z compiled: bool, 2025-05-07T20:32:58.2293936Z ) -> None: 2025-05-07T20:32:58.2294364Z torch.manual_seed(2025) 2025-05-07T20:32:58.2294820Z 2025-05-07T20:32:58.2295359Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.2298709Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:58.2300735Z 2025-05-07T20:32:58.2300856Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:58.2301065Z 2025-05-07T20:32:58.2301175Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.2301583Z self=, 2025-05-07T20:32:58.2301975Z T=1, 2025-05-07T20:32:58.2302161Z D=5120, 2025-05-07T20:32:58.2302345Z scale_ub=1200.0, 2025-05-07T20:32:58.2302566Z contiguous=True, 2025-05-07T20:32:58.2302787Z compiled=False, 2025-05-07T20:32:58.2302989Z ) 2025-05-07T20:32:58.2303304Z self = 2025-05-07T20:32:58.2303789Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:58.2304046Z 2025-05-07T20:32:58.2304127Z @given( 2025-05-07T20:32:58.2304351Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.2304660Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.2304963Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.2305350Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.2305680Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.2305958Z ) 2025-05-07T20:32:58.2306298Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.2306731Z def test_silu_mul_quant( 2025-05-07T20:32:58.2306975Z self, 2025-05-07T20:32:58.2307164Z T: int, 2025-05-07T20:32:58.2307356Z D: int, 2025-05-07T20:32:58.2307572Z scale_ub: Optional[float], 2025-05-07T20:32:58.2307840Z contiguous: bool, 2025-05-07T20:32:58.2308077Z compiled: bool, 2025-05-07T20:32:58.2308293Z ) -> None: 2025-05-07T20:32:58.2308514Z torch.manual_seed(2025) 2025-05-07T20:32:58.2308746Z 2025-05-07T20:32:58.2309014Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.2309346Z 2025-05-07T20:32:58.2309533Z x_sign = torch.sign(x) 2025-05-07T20:32:58.2309819Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:58.2310126Z x = x_sign * x_clamp 2025-05-07T20:32:58.2310433Z x0 = x[:, :D] 2025-05-07T20:32:58.2310646Z x1 = x[:, D:] 2025-05-07T20:32:58.2310854Z 2025-05-07T20:32:58.2311030Z if contiguous: 2025-05-07T20:32:58.2311264Z x0 = x0.contiguous() 2025-05-07T20:32:58.2311522Z x1 = x1.contiguous() 2025-05-07T20:32:58.2311753Z 2025-05-07T20:32:58.2311945Z if scale_ub is not None: 2025-05-07T20:32:58.2312217Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:58.2312551Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:58.2312847Z ) 2025-05-07T20:32:58.2313035Z else: 2025-05-07T20:32:58.2313244Z scale_ub_tensor = None 2025-05-07T20:32:58.2313485Z 2025-05-07T20:32:58.2313714Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:58.2314021Z op = silu_mul_quant 2025-05-07T20:32:58.2314262Z if compiled: 2025-05-07T20:32:58.2314508Z op = torch.compile(op) 2025-05-07T20:32:58.2314805Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.2315067Z 2025-05-07T20:32:58.2315257Z > y_fp8, y_scale = fn() 2025-05-07T20:32:58.2315418Z 2025-05-07T20:32:58.2315523Z moe/activation_test.py:117: 2025-05-07T20:32:58.2315809Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.2316133Z moe/activation_test.py:115: in fn 2025-05-07T20:32:58.2316416Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.2317196Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:58.2317876Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:58.2318456Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:58.2319126Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:58.2319784Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:58.2320308Z kernel = self.compile( 2025-05-07T20:32:58.2320842Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:58.2321494Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:58.2321876Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.2322106Z 2025-05-07T20:32:58.2322311Z self = 2025-05-07T20:32:58.2323393Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:58.2324830Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3132d5e200>} 2025-05-07T20:32:58.2326151Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:58.2327158Z context = 2025-05-07T20:32:58.2327445Z 2025-05-07T20:32:58.2327615Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:58.2328129Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:58.2328588Z module_map=module_map) 2025-05-07T20:32:58.2328952Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:58.2329297Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:58.2329557Z E ^ 2025-05-07T20:32:58.2330020Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:58.2330521Z 2025-05-07T20:32:58.2330937Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:58.2331446Z 2025-05-07T20:32:58.2331550Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.2331958Z self=, 2025-05-07T20:32:58.2332351Z T=2048, 2025-05-07T20:32:58.2332531Z D=5120, 2025-05-07T20:32:58.2332727Z scale_ub=None, 2025-05-07T20:32:58.2332942Z contiguous=True, 2025-05-07T20:32:58.2333160Z compiled=False, 2025-05-07T20:32:58.2333364Z ) 2025-05-07T20:32:58.2333679Z self = 2025-05-07T20:32:58.2334166Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:58.2334431Z 2025-05-07T20:32:58.2334508Z @given( 2025-05-07T20:32:58.2334740Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.2335051Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.2335347Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.2335667Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.2335995Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.2336268Z ) 2025-05-07T20:32:58.2336614Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.2337098Z def test_silu_mul_quant( 2025-05-07T20:32:58.2337330Z self, 2025-05-07T20:32:58.2337564Z T: int, 2025-05-07T20:32:58.2337764Z D: int, 2025-05-07T20:32:58.2337977Z scale_ub: Optional[float], 2025-05-07T20:32:58.2338242Z contiguous: bool, 2025-05-07T20:32:58.2338482Z compiled: bool, 2025-05-07T20:32:58.2338698Z ) -> None: 2025-05-07T20:32:58.2338914Z torch.manual_seed(2025) 2025-05-07T20:32:58.2339155Z 2025-05-07T20:32:58.2339420Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.2339878Z 2025-05-07T20:32:58.2340077Z > x_sign = torch.sign(x) 2025-05-07T20:32:58.2342000Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:58.2343860Z 2025-05-07T20:32:58.2343984Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:58.2344192Z 2025-05-07T20:32:58.2344405Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.2344822Z self=, 2025-05-07T20:32:58.2345224Z T=16384, 2025-05-07T20:32:58.2345409Z D=5120, 2025-05-07T20:32:58.2345608Z scale_ub=None, 2025-05-07T20:32:58.2345825Z contiguous=True, 2025-05-07T20:32:58.2346046Z compiled=False, 2025-05-07T20:32:58.2346244Z ) 2025-05-07T20:32:58.3292055Z self = 2025-05-07T20:32:58.3292565Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:58.3292857Z 2025-05-07T20:32:58.3292945Z @given( 2025-05-07T20:32:58.3293172Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.3293485Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.3293791Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.3294120Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.3294453Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.3294833Z ) 2025-05-07T20:32:58.3295169Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.3295608Z def test_silu_mul_quant( 2025-05-07T20:32:58.3295844Z self, 2025-05-07T20:32:58.3296034Z T: int, 2025-05-07T20:32:58.3296224Z D: int, 2025-05-07T20:32:58.3296444Z scale_ub: Optional[float], 2025-05-07T20:32:58.3296712Z contiguous: bool, 2025-05-07T20:32:58.3296942Z compiled: bool, 2025-05-07T20:32:58.3297160Z ) -> None: 2025-05-07T20:32:58.3297370Z torch.manual_seed(2025) 2025-05-07T20:32:58.3297602Z 2025-05-07T20:32:58.3297870Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.3299966Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:58.3301798Z 2025-05-07T20:32:58.3301920Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:58.3302132Z 2025-05-07T20:32:58.3302319Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.3302775Z self=, 2025-05-07T20:32:58.3303172Z T=4096, 2025-05-07T20:32:58.3303354Z D=5120, 2025-05-07T20:32:58.3303538Z scale_ub=None, 2025-05-07T20:32:58.3303749Z contiguous=True, 2025-05-07T20:32:58.3303966Z compiled=False, 2025-05-07T20:32:58.3304164Z ) 2025-05-07T20:32:58.3304477Z self = 2025-05-07T20:32:58.3304961Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:58.3305228Z 2025-05-07T20:32:58.3305301Z @given( 2025-05-07T20:32:58.3305526Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.3305829Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.3306131Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.3306449Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.3306774Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.3307049Z ) 2025-05-07T20:32:58.3307392Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.3307830Z def test_silu_mul_quant( 2025-05-07T20:32:58.3308112Z self, 2025-05-07T20:32:58.3308312Z T: int, 2025-05-07T20:32:58.3308514Z D: int, 2025-05-07T20:32:58.3308745Z scale_ub: Optional[float], 2025-05-07T20:32:58.3309107Z contiguous: bool, 2025-05-07T20:32:58.3309368Z compiled: bool, 2025-05-07T20:32:58.3309603Z ) -> None: 2025-05-07T20:32:58.3309832Z torch.manual_seed(2025) 2025-05-07T20:32:58.3310084Z 2025-05-07T20:32:58.3310373Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.3312954Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:58.3315327Z 2025-05-07T20:32:58.3315455Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:58.3315693Z 2025-05-07T20:32:58.3315854Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.3316315Z self=, 2025-05-07T20:32:58.3316773Z T=2048, 2025-05-07T20:32:58.3316964Z D=5120, 2025-05-07T20:32:58.3317158Z scale_ub=None, 2025-05-07T20:32:58.3317381Z contiguous=False, 2025-05-07T20:32:58.3317621Z compiled=False, 2025-05-07T20:32:58.3317834Z ) 2025-05-07T20:32:58.3318182Z self = 2025-05-07T20:32:58.3318750Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:58.3319067Z 2025-05-07T20:32:58.3319144Z @given( 2025-05-07T20:32:58.3319379Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.3319720Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.3320056Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.3320423Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.3320798Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.3321114Z ) 2025-05-07T20:32:58.3321504Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.3322025Z def test_silu_mul_quant( 2025-05-07T20:32:58.3322284Z self, 2025-05-07T20:32:58.3322482Z T: int, 2025-05-07T20:32:58.3322686Z D: int, 2025-05-07T20:32:58.3322918Z scale_ub: Optional[float], 2025-05-07T20:32:58.3323261Z contiguous: bool, 2025-05-07T20:32:58.3323517Z compiled: bool, 2025-05-07T20:32:58.3323752Z ) -> None: 2025-05-07T20:32:58.3324014Z torch.manual_seed(2025) 2025-05-07T20:32:58.3324272Z 2025-05-07T20:32:58.3324567Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.3327153Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:58.3329576Z 2025-05-07T20:32:58.3329705Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:58.3329947Z 2025-05-07T20:32:58.3330055Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.3330526Z self=, 2025-05-07T20:32:58.3330992Z T=4096, 2025-05-07T20:32:58.3331186Z D=7168, 2025-05-07T20:32:58.3331379Z scale_ub=None, 2025-05-07T20:32:58.3331604Z contiguous=True, 2025-05-07T20:32:58.3331839Z compiled=True, 2025-05-07T20:32:58.3332087Z ) 2025-05-07T20:32:58.3332444Z self = 2025-05-07T20:32:58.3333011Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:58.3333320Z 2025-05-07T20:32:58.3333398Z @given( 2025-05-07T20:32:58.3333637Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.3333980Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.3334312Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.3334679Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.3335041Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.3335358Z ) 2025-05-07T20:32:58.3335748Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.3336256Z def test_silu_mul_quant( 2025-05-07T20:32:58.3336516Z self, 2025-05-07T20:32:58.3336713Z T: int, 2025-05-07T20:32:58.3336916Z D: int, 2025-05-07T20:32:58.3337148Z scale_ub: Optional[float], 2025-05-07T20:32:58.3337489Z contiguous: bool, 2025-05-07T20:32:58.3337744Z compiled: bool, 2025-05-07T20:32:58.3337975Z ) -> None: 2025-05-07T20:32:58.3338196Z torch.manual_seed(2025) 2025-05-07T20:32:58.3338452Z 2025-05-07T20:32:58.3338744Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.3341395Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:58.3343778Z 2025-05-07T20:32:58.3343906Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:58.3344148Z 2025-05-07T20:32:58.3344258Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.3344723Z self=, 2025-05-07T20:32:58.3345178Z T=2048, 2025-05-07T20:32:58.3345366Z D=5120, 2025-05-07T20:32:58.3345564Z scale_ub=1200.0, 2025-05-07T20:32:58.3345803Z contiguous=False, 2025-05-07T20:32:58.3346038Z compiled=False, 2025-05-07T20:32:58.3346308Z ) 2025-05-07T20:32:58.3346659Z self = 2025-05-07T20:32:58.3347290Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:58.3347569Z 2025-05-07T20:32:58.3347644Z @given( 2025-05-07T20:32:58.3347868Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.3348177Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.3348475Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.3348804Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.3349126Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.3349396Z ) 2025-05-07T20:32:58.3349745Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.3350181Z def test_silu_mul_quant( 2025-05-07T20:32:58.3350411Z self, 2025-05-07T20:32:58.3350603Z T: int, 2025-05-07T20:32:58.3350803Z D: int, 2025-05-07T20:32:58.3356178Z scale_ub: Optional[float], 2025-05-07T20:32:58.3356471Z contiguous: bool, 2025-05-07T20:32:58.3356721Z compiled: bool, 2025-05-07T20:32:58.3356956Z ) -> None: 2025-05-07T20:32:58.3357173Z torch.manual_seed(2025) 2025-05-07T20:32:58.3357427Z 2025-05-07T20:32:58.3357732Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.3359877Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:58.3361762Z 2025-05-07T20:32:58.3361889Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:58.3362109Z 2025-05-07T20:32:58.3362219Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.3362641Z self=, 2025-05-07T20:32:58.3363047Z T=4096, 2025-05-07T20:32:58.3363236Z D=7168, 2025-05-07T20:32:58.3363434Z scale_ub=1200.0, 2025-05-07T20:32:58.3363670Z contiguous=True, 2025-05-07T20:32:58.3363896Z compiled=False, 2025-05-07T20:32:58.3364164Z ) 2025-05-07T20:32:58.4621271Z self = 2025-05-07T20:32:58.4621822Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:58.4622102Z 2025-05-07T20:32:58.4622184Z @given( 2025-05-07T20:32:58.4622425Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.4622735Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.4623050Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.4623386Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.4623721Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.4624002Z ) 2025-05-07T20:32:58.4624352Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.4624800Z def test_silu_mul_quant( 2025-05-07T20:32:58.4625043Z self, 2025-05-07T20:32:58.4625248Z T: int, 2025-05-07T20:32:58.4625453Z D: int, 2025-05-07T20:32:58.4625677Z scale_ub: Optional[float], 2025-05-07T20:32:58.4625946Z contiguous: bool, 2025-05-07T20:32:58.4626190Z compiled: bool, 2025-05-07T20:32:58.4626418Z ) -> None: 2025-05-07T20:32:58.4626632Z torch.manual_seed(2025) 2025-05-07T20:32:58.4626878Z 2025-05-07T20:32:58.4627151Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.4629276Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:58.4631192Z 2025-05-07T20:32:58.4631312Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:58.4631530Z 2025-05-07T20:32:58.4631634Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.4632050Z self=, 2025-05-07T20:32:58.4632455Z T=16384, 2025-05-07T20:32:58.4632648Z D=7168, 2025-05-07T20:32:58.4632846Z scale_ub=None, 2025-05-07T20:32:58.4633070Z contiguous=False, 2025-05-07T20:32:58.4633299Z compiled=True, 2025-05-07T20:32:58.4633509Z ) 2025-05-07T20:32:58.4633832Z self = 2025-05-07T20:32:58.4634320Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:58.4634598Z 2025-05-07T20:32:58.4634674Z @given( 2025-05-07T20:32:58.4634906Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.4635271Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.4635579Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.4635905Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.4636244Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.4636514Z ) 2025-05-07T20:32:58.4636858Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.4637293Z def test_silu_mul_quant( 2025-05-07T20:32:58.4637530Z self, 2025-05-07T20:32:58.4637724Z T: int, 2025-05-07T20:32:58.4637925Z D: int, 2025-05-07T20:32:58.4638140Z scale_ub: Optional[float], 2025-05-07T20:32:58.4638414Z contiguous: bool, 2025-05-07T20:32:58.4638660Z compiled: bool, 2025-05-07T20:32:58.4638876Z ) -> None: 2025-05-07T20:32:58.4639091Z torch.manual_seed(2025) 2025-05-07T20:32:58.4639329Z 2025-05-07T20:32:58.4639592Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.4641614Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:58.4643529Z 2025-05-07T20:32:58.4643649Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:58.4643860Z 2025-05-07T20:32:58.4643965Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.4644371Z self=, 2025-05-07T20:32:58.4644759Z T=4096, 2025-05-07T20:32:58.4644946Z D=7168, 2025-05-07T20:32:58.4645142Z scale_ub=None, 2025-05-07T20:32:58.4645352Z contiguous=True, 2025-05-07T20:32:58.4645573Z compiled=False, 2025-05-07T20:32:58.4645773Z ) 2025-05-07T20:32:58.4646081Z self = 2025-05-07T20:32:58.4646567Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:58.4646832Z 2025-05-07T20:32:58.4646905Z @given( 2025-05-07T20:32:58.4647129Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.4647478Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.4647781Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.4648170Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.4648522Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.4648804Z ) 2025-05-07T20:32:58.4649156Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.4649595Z def test_silu_mul_quant( 2025-05-07T20:32:58.4649825Z self, 2025-05-07T20:32:58.4650021Z T: int, 2025-05-07T20:32:58.4650217Z D: int, 2025-05-07T20:32:58.4650433Z scale_ub: Optional[float], 2025-05-07T20:32:58.4650699Z contiguous: bool, 2025-05-07T20:32:58.4650934Z compiled: bool, 2025-05-07T20:32:58.4651148Z ) -> None: 2025-05-07T20:32:58.4651362Z torch.manual_seed(2025) 2025-05-07T20:32:58.4651592Z 2025-05-07T20:32:58.4651856Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.4653921Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:58.4655764Z 2025-05-07T20:32:58.4655880Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:58.4656093Z 2025-05-07T20:32:58.4656197Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.4656602Z self=, 2025-05-07T20:32:58.4656992Z T=16384, 2025-05-07T20:32:58.4657182Z D=7168, 2025-05-07T20:32:58.4657372Z scale_ub=None, 2025-05-07T20:32:58.4657581Z contiguous=True, 2025-05-07T20:32:58.4657801Z compiled=False, 2025-05-07T20:32:58.4658003Z ) 2025-05-07T20:32:58.4658308Z self = 2025-05-07T20:32:58.4658813Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:58.4659093Z 2025-05-07T20:32:58.4659167Z @given( 2025-05-07T20:32:58.4659398Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.4659835Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.4660140Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.4660462Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.4660788Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.4661061Z ) 2025-05-07T20:32:58.4661406Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.4661846Z def test_silu_mul_quant( 2025-05-07T20:32:58.4662077Z self, 2025-05-07T20:32:58.4662270Z T: int, 2025-05-07T20:32:58.4662467Z D: int, 2025-05-07T20:32:58.4662683Z scale_ub: Optional[float], 2025-05-07T20:32:58.4662949Z contiguous: bool, 2025-05-07T20:32:58.4663185Z compiled: bool, 2025-05-07T20:32:58.4663402Z ) -> None: 2025-05-07T20:32:58.4663614Z torch.manual_seed(2025) 2025-05-07T20:32:58.4663855Z 2025-05-07T20:32:58.4664120Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.4666192Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:58.4668063Z 2025-05-07T20:32:58.4668180Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:58.4668395Z 2025-05-07T20:32:58.4668498Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.4668900Z self=, 2025-05-07T20:32:58.4669289Z T=16384, 2025-05-07T20:32:58.4669482Z D=7168, 2025-05-07T20:32:58.4669669Z scale_ub=1200.0, 2025-05-07T20:32:58.4669883Z contiguous=True, 2025-05-07T20:32:58.4670102Z compiled=False, 2025-05-07T20:32:58.4670304Z ) 2025-05-07T20:32:58.4670610Z self = 2025-05-07T20:32:58.4671096Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:58.4671378Z 2025-05-07T20:32:58.4671457Z @given( 2025-05-07T20:32:58.4671684Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.4671988Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.4672296Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.4672619Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.4672939Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.4673216Z ) 2025-05-07T20:32:58.4673604Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.4674041Z def test_silu_mul_quant( 2025-05-07T20:32:58.4674275Z self, 2025-05-07T20:32:58.4674465Z T: int, 2025-05-07T20:32:58.4674657Z D: int, 2025-05-07T20:32:58.4674869Z scale_ub: Optional[float], 2025-05-07T20:32:58.4675134Z contiguous: bool, 2025-05-07T20:32:58.4675371Z compiled: bool, 2025-05-07T20:32:58.4675585Z ) -> None: 2025-05-07T20:32:58.4675797Z torch.manual_seed(2025) 2025-05-07T20:32:58.4676036Z 2025-05-07T20:32:58.4676300Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.4678319Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:58.4680240Z 2025-05-07T20:32:58.4680359Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:58.4680565Z 2025-05-07T20:32:58.4680677Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.4681082Z self=, 2025-05-07T20:32:58.4681479Z T=128, 2025-05-07T20:32:58.4681662Z D=5120, 2025-05-07T20:32:58.4681855Z scale_ub=1200.0, 2025-05-07T20:32:58.4682076Z contiguous=False, 2025-05-07T20:32:58.4682296Z compiled=False, 2025-05-07T20:32:58.4682494Z ) 2025-05-07T20:32:58.6093583Z self = 2025-05-07T20:32:58.6094637Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:58.6095196Z 2025-05-07T20:32:58.6095349Z @given( 2025-05-07T20:32:58.6095812Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.6096428Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.6097036Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.6097687Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.6098296Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.6098622Z ) 2025-05-07T20:32:58.6098977Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.6099557Z def test_silu_mul_quant( 2025-05-07T20:32:58.6099983Z self, 2025-05-07T20:32:58.6100185Z T: int, 2025-05-07T20:32:58.6100388Z D: int, 2025-05-07T20:32:58.6100612Z scale_ub: Optional[float], 2025-05-07T20:32:58.6100888Z contiguous: bool, 2025-05-07T20:32:58.6101130Z compiled: bool, 2025-05-07T20:32:58.6101354Z ) -> None: 2025-05-07T20:32:58.6101571Z torch.manual_seed(2025) 2025-05-07T20:32:58.6101813Z 2025-05-07T20:32:58.6102089Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.6102421Z 2025-05-07T20:32:58.6102617Z x_sign = torch.sign(x) 2025-05-07T20:32:58.6102911Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:58.6103214Z x = x_sign * x_clamp 2025-05-07T20:32:58.6103460Z x0 = x[:, :D] 2025-05-07T20:32:58.6103678Z x1 = x[:, D:] 2025-05-07T20:32:58.6103887Z 2025-05-07T20:32:58.6104076Z if contiguous: 2025-05-07T20:32:58.6104313Z x0 = x0.contiguous() 2025-05-07T20:32:58.6104568Z x1 = x1.contiguous() 2025-05-07T20:32:58.6104809Z 2025-05-07T20:32:58.6105008Z if scale_ub is not None: 2025-05-07T20:32:58.6105275Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:58.6105611Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:58.6105987Z ) 2025-05-07T20:32:58.6106182Z else: 2025-05-07T20:32:58.6106394Z scale_ub_tensor = None 2025-05-07T20:32:58.6106652Z 2025-05-07T20:32:58.6106885Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:58.6107193Z op = silu_mul_quant 2025-05-07T20:32:58.6107445Z if compiled: 2025-05-07T20:32:58.6107696Z op = torch.compile(op) 2025-05-07T20:32:58.6107990Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.6108267Z 2025-05-07T20:32:58.6108466Z > y_fp8, y_scale = fn() 2025-05-07T20:32:58.6108631Z 2025-05-07T20:32:58.6108736Z moe/activation_test.py:117: 2025-05-07T20:32:58.6109038Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.6109375Z moe/activation_test.py:115: in fn 2025-05-07T20:32:58.6109657Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.6110354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:58.6111121Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:58.6111657Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:58.6112334Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:58.6112996Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:58.6113529Z kernel = self.compile( 2025-05-07T20:32:58.6114073Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:58.6114721Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:58.6115126Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.6115353Z 2025-05-07T20:32:58.6115573Z self = 2025-05-07T20:32:58.6116646Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:58.6118016Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3133089ea0>} 2025-05-07T20:32:58.6119427Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:58.6120441Z context = 2025-05-07T20:32:58.6120728Z 2025-05-07T20:32:58.6120896Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:58.6121411Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:58.6121881Z module_map=module_map) 2025-05-07T20:32:58.6122249Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:58.6122600Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:58.6122855Z E ^ 2025-05-07T20:32:58.6123319Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:58.6123766Z 2025-05-07T20:32:58.6124186Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:58.6124701Z 2025-05-07T20:32:58.6124808Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.6125221Z self=, 2025-05-07T20:32:58.6125616Z T=2048, 2025-05-07T20:32:58.6125806Z D=7168, 2025-05-07T20:32:58.6125997Z scale_ub=None, 2025-05-07T20:32:58.6126257Z contiguous=False, 2025-05-07T20:32:58.6126495Z compiled=False, 2025-05-07T20:32:58.6126699Z ) 2025-05-07T20:32:58.6127021Z self = 2025-05-07T20:32:58.6127513Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:58.6127782Z 2025-05-07T20:32:58.6127857Z @given( 2025-05-07T20:32:58.6128089Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.6128403Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.6128709Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.6129045Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.6129374Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.6129660Z ) 2025-05-07T20:32:58.6130004Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.6130441Z def test_silu_mul_quant( 2025-05-07T20:32:58.6130681Z self, 2025-05-07T20:32:58.6130926Z T: int, 2025-05-07T20:32:58.6131126Z D: int, 2025-05-07T20:32:58.6131351Z scale_ub: Optional[float], 2025-05-07T20:32:58.6131616Z contiguous: bool, 2025-05-07T20:32:58.6131858Z compiled: bool, 2025-05-07T20:32:58.6132083Z ) -> None: 2025-05-07T20:32:58.6132296Z torch.manual_seed(2025) 2025-05-07T20:32:58.6132539Z 2025-05-07T20:32:58.6132807Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.6134841Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:58.6136679Z 2025-05-07T20:32:58.6136803Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:58.6137012Z 2025-05-07T20:32:58.6137117Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.6137528Z self=, 2025-05-07T20:32:58.6137944Z T=128, 2025-05-07T20:32:58.6138158Z D=7168, 2025-05-07T20:32:58.6138357Z scale_ub=1200.0, 2025-05-07T20:32:58.6138631Z contiguous=True, 2025-05-07T20:32:58.6138849Z compiled=True, 2025-05-07T20:32:58.6139052Z ) 2025-05-07T20:32:58.6557713Z self = 2025-05-07T20:32:58.6558286Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:58.6558555Z 2025-05-07T20:32:58.6558633Z @given( 2025-05-07T20:32:58.6558866Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.6559169Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.6559479Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.6559807Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.6560133Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.6560418Z ) 2025-05-07T20:32:58.6560766Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.6561208Z def test_silu_mul_quant( 2025-05-07T20:32:58.6561456Z self, 2025-05-07T20:32:58.6561652Z T: int, 2025-05-07T20:32:58.6561844Z D: int, 2025-05-07T20:32:58.6562070Z scale_ub: Optional[float], 2025-05-07T20:32:58.6562349Z contiguous: bool, 2025-05-07T20:32:58.6562585Z compiled: bool, 2025-05-07T20:32:58.6562812Z ) -> None: 2025-05-07T20:32:58.6563034Z torch.manual_seed(2025) 2025-05-07T20:32:58.6563273Z 2025-05-07T20:32:58.6563611Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.6563954Z 2025-05-07T20:32:58.6564151Z x_sign = torch.sign(x) 2025-05-07T20:32:58.6564441Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:58.6564749Z x = x_sign * x_clamp 2025-05-07T20:32:58.6564995Z x0 = x[:, :D] 2025-05-07T20:32:58.6565208Z x1 = x[:, D:] 2025-05-07T20:32:58.6565418Z 2025-05-07T20:32:58.6565606Z if contiguous: 2025-05-07T20:32:58.6565837Z x0 = x0.contiguous() 2025-05-07T20:32:58.6566098Z x1 = x1.contiguous() 2025-05-07T20:32:58.6566341Z 2025-05-07T20:32:58.6566536Z if scale_ub is not None: 2025-05-07T20:32:58.6566810Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:58.6567145Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:58.6567445Z ) 2025-05-07T20:32:58.6567643Z else: 2025-05-07T20:32:58.6567856Z scale_ub_tensor = None 2025-05-07T20:32:58.6568106Z 2025-05-07T20:32:58.6568336Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:58.6568730Z op = silu_mul_quant 2025-05-07T20:32:58.6568982Z if compiled: 2025-05-07T20:32:58.6569230Z op = torch.compile(op) 2025-05-07T20:32:58.6569524Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.6569794Z 2025-05-07T20:32:58.6569985Z > y_fp8, y_scale = fn() 2025-05-07T20:32:58.6570151Z 2025-05-07T20:32:58.6570250Z moe/activation_test.py:117: 2025-05-07T20:32:58.6570546Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.6570874Z moe/activation_test.py:115: in fn 2025-05-07T20:32:58.6571154Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.6571713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:58.6572269Z return fn(*args, **kwargs) 2025-05-07T20:32:58.6572926Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:58.6573615Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:58.6574147Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:58.6574818Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:58.6575475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:58.6576083Z kernel = self.compile( 2025-05-07T20:32:58.6576664Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:58.6577310Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:58.6577706Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.6577934Z 2025-05-07T20:32:58.6578149Z self = 2025-05-07T20:32:58.6579221Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:58.6580682Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f313308b7f0>} 2025-05-07T20:32:58.6582022Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:58.6583041Z context = 2025-05-07T20:32:58.6583323Z 2025-05-07T20:32:58.6583544Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:58.6584065Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:58.6584529Z module_map=module_map) 2025-05-07T20:32:58.6584897Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:58.6585249Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:58.6585505Z E ^ 2025-05-07T20:32:58.6585970Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:58.6586423Z 2025-05-07T20:32:58.6586846Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:58.6587354Z 2025-05-07T20:32:58.6587464Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.6587873Z self=, 2025-05-07T20:32:58.6594722Z T=128, 2025-05-07T20:32:58.6594941Z D=7168, 2025-05-07T20:32:58.6595144Z scale_ub=1200.0, 2025-05-07T20:32:58.6595501Z contiguous=True, 2025-05-07T20:32:58.6595737Z compiled=False, 2025-05-07T20:32:58.6595959Z ) 2025-05-07T20:32:58.6596286Z self = 2025-05-07T20:32:58.6596791Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:58.6597068Z 2025-05-07T20:32:58.6597158Z @given( 2025-05-07T20:32:58.6597397Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.6597727Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.6598051Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.6598389Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.6598724Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.6599016Z ) 2025-05-07T20:32:58.6599376Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.6599820Z def test_silu_mul_quant( 2025-05-07T20:32:58.6600072Z self, 2025-05-07T20:32:58.6600281Z T: int, 2025-05-07T20:32:58.6600480Z D: int, 2025-05-07T20:32:58.6600709Z scale_ub: Optional[float], 2025-05-07T20:32:58.6600987Z contiguous: bool, 2025-05-07T20:32:58.6601225Z compiled: bool, 2025-05-07T20:32:58.6601457Z ) -> None: 2025-05-07T20:32:58.6601678Z torch.manual_seed(2025) 2025-05-07T20:32:58.6601920Z 2025-05-07T20:32:58.6602200Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.6602626Z 2025-05-07T20:32:58.6602825Z x_sign = torch.sign(x) 2025-05-07T20:32:58.6603195Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:58.6605198Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:58.6607048Z 2025-05-07T20:32:58.6607176Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:58.6607395Z 2025-05-07T20:32:58.6607508Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.6607923Z self=, 2025-05-07T20:32:58.6608364Z T=128, 2025-05-07T20:32:58.6608585Z D=5120, 2025-05-07T20:32:58.6608778Z scale_ub=1200.0, 2025-05-07T20:32:58.6609007Z contiguous=True, 2025-05-07T20:32:58.6609244Z compiled=True, 2025-05-07T20:32:58.6609448Z ) 2025-05-07T20:32:58.6609835Z self = 2025-05-07T20:32:58.6610323Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:58.6610593Z 2025-05-07T20:32:58.6610683Z @given( 2025-05-07T20:32:58.6610913Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.6611231Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.6611546Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.6611874Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.6612214Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.6612504Z ) 2025-05-07T20:32:58.6612855Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.6613300Z def test_silu_mul_quant( 2025-05-07T20:32:58.6613545Z self, 2025-05-07T20:32:58.6613747Z T: int, 2025-05-07T20:32:58.6613945Z D: int, 2025-05-07T20:32:58.6614169Z scale_ub: Optional[float], 2025-05-07T20:32:58.6614447Z contiguous: bool, 2025-05-07T20:32:58.6614691Z compiled: bool, 2025-05-07T20:32:58.6614980Z ) -> None: 2025-05-07T20:32:58.6615202Z torch.manual_seed(2025) 2025-05-07T20:32:58.6615441Z 2025-05-07T20:32:58.6615733Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.6616074Z 2025-05-07T20:32:58.6616270Z x_sign = torch.sign(x) 2025-05-07T20:32:58.6616564Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:58.6618547Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:58.6620466Z 2025-05-07T20:32:58.6620590Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:58.6620802Z 2025-05-07T20:32:58.6620912Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.6621322Z self=, 2025-05-07T20:32:58.6621732Z T=128, 2025-05-07T20:32:58.6621924Z D=7168, 2025-05-07T20:32:58.6622122Z scale_ub=None, 2025-05-07T20:32:58.6622335Z contiguous=True, 2025-05-07T20:32:58.6622614Z compiled=True, 2025-05-07T20:32:58.6622822Z ) 2025-05-07T20:32:58.8671756Z self = 2025-05-07T20:32:58.8672284Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:58.8672544Z 2025-05-07T20:32:58.8672625Z @given( 2025-05-07T20:32:58.8672858Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.8673164Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.8673480Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.8673809Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.8674130Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.8674419Z ) 2025-05-07T20:32:58.8674770Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.8675206Z def test_silu_mul_quant( 2025-05-07T20:32:58.8675447Z self, 2025-05-07T20:32:58.8675644Z T: int, 2025-05-07T20:32:58.8675847Z D: int, 2025-05-07T20:32:58.8676061Z scale_ub: Optional[float], 2025-05-07T20:32:58.8676337Z contiguous: bool, 2025-05-07T20:32:58.8676579Z compiled: bool, 2025-05-07T20:32:58.8676803Z ) -> None: 2025-05-07T20:32:58.8677020Z torch.manual_seed(2025) 2025-05-07T20:32:58.8677264Z 2025-05-07T20:32:58.8677598Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.8679620Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:58.8681458Z 2025-05-07T20:32:58.8681580Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:58.8681791Z 2025-05-07T20:32:58.8693361Z FAILED 2025-05-07T20:32:58.8693583Z 2025-05-07T20:32:58.8693849Z =================================== FAILURES =================================== 2025-05-07T20:32:58.8694490Z _____________________ ActivationTests.test_silu_mul_quant ______________________ 2025-05-07T20:32:58.8695113Z + Exception Group Traceback (most recent call last): 2025-05-07T20:32:58.8696195Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/unittest/case.py", line 59, in testPartExecutor 2025-05-07T20:32:58.8696954Z | yield 2025-05-07T20:32:58.8697539Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/unittest/case.py", line 591, in run 2025-05-07T20:32:58.8698493Z | self._callTestMethod(testMethod) 2025-05-07T20:32:58.8699270Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/unittest/case.py", line 549, in _callTestMethod 2025-05-07T20:32:58.8700108Z | method() 2025-05-07T20:32:58.8700995Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant 2025-05-07T20:32:58.8702192Z | T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.8703085Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/hypothesis/core.py", line 1850, in wrapped_test 2025-05-07T20:32:58.8703967Z | raise the_error_hypothesis_found 2025-05-07T20:32:58.8704657Z | exceptiongroup.ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions) 2025-05-07T20:32:58.8705327Z +-+---------------- 1 ---------------- 2025-05-07T20:32:58.8705725Z | Traceback (most recent call last): 2025-05-07T20:32:58.8706714Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:32:58.8707996Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.8710868Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:58.8713614Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:58.8714224Z | self=, 2025-05-07T20:32:58.8714778Z | T=2048, 2025-05-07T20:32:58.8715092Z | D=5120, # or any other generated value 2025-05-07T20:32:58.8715562Z | scale_ub=None, # or any other generated value 2025-05-07T20:32:58.8716059Z | contiguous=True, # or any other generated value 2025-05-07T20:32:58.8716598Z | compiled=False, # or any other generated value 2025-05-07T20:32:58.8717014Z | ) 2025-05-07T20:32:58.8717265Z | 2025-05-07T20:32:58.8718127Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEECQQBBAEEAQQE=') as a decorator on your test case 2025-05-07T20:32:58.8718960Z +---------------- 2 ---------------- 2025-05-07T20:32:58.8719377Z | Traceback (most recent call last): 2025-05-07T20:32:58.8720390Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:32:58.8721454Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.8724270Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:58.8727041Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:58.8727650Z | self=, 2025-05-07T20:32:58.8728222Z | T=128, 2025-05-07T20:32:58.8728504Z | D=7168, 2025-05-07T20:32:58.8728779Z | scale_ub=None, 2025-05-07T20:32:58.8729107Z | contiguous=True, 2025-05-07T20:32:58.8729475Z | compiled=True, 2025-05-07T20:32:58.8729790Z | ) 2025-05-07T20:32:58.8730035Z | 2025-05-07T20:32:58.8730766Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case 2025-05-07T20:32:58.8731603Z +---------------- 3 ---------------- 2025-05-07T20:32:58.8732001Z | Traceback (most recent call last): 2025-05-07T20:32:58.8732850Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:32:58.8733615Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.8735679Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:58.8737653Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:58.8738085Z | self=, 2025-05-07T20:32:58.8738661Z | T=128, 2025-05-07T20:32:58.8738944Z | D=5120, 2025-05-07T20:32:58.8739239Z | scale_ub=1200.0, 2025-05-07T20:32:58.8739585Z | contiguous=True, 2025-05-07T20:32:58.8740038Z | compiled=True, 2025-05-07T20:32:58.8741931Z | ) 2025-05-07T20:32:58.8742197Z | 2025-05-07T20:32:58.8742937Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case 2025-05-07T20:32:58.8743792Z +---------------- 4 ---------------- 2025-05-07T20:32:58.8744217Z | Traceback (most recent call last): 2025-05-07T20:32:58.8745260Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant 2025-05-07T20:32:58.8746314Z | y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:58.8747294Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn 2025-05-07T20:32:58.8748028Z | return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:58.8749220Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row 2025-05-07T20:32:58.8750361Z | _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:58.8751238Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py", line 330, in 2025-05-07T20:32:58.8752312Z | return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:58.8755158Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 186, in run 2025-05-07T20:32:58.8756266Z | timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:58.8757439Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 186, in 2025-05-07T20:32:58.8758785Z | timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:58.8759888Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 166, in _bench 2025-05-07T20:32:58.8760840Z | return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:58.8761759Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py", line 117, in do_bench 2025-05-07T20:32:58.8762554Z | fn() 2025-05-07T20:32:58.8763348Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call 2025-05-07T20:32:58.8764222Z | self.fn.run( 2025-05-07T20:32:58.8764954Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py", line 623, in run 2025-05-07T20:32:58.8765756Z | kernel = self.compile( 2025-05-07T20:32:58.8766603Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 273, in compile 2025-05-07T20:32:58.8767577Z | module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:58.8768561Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:58.8769731Z | return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:58.8770548Z | triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:58.8773987Z | def _kernel_quantize_fp8_row( 2025-05-07T20:32:58.8774382Z | ^ 2025-05-07T20:32:58.8775019Z | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:58.8775812Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:58.8776361Z | # The test always failed when commented parts were varied together. 2025-05-07T20:32:58.8777071Z | self=, 2025-05-07T20:32:58.8777658Z | T=1, # or any other generated value 2025-05-07T20:32:58.8778092Z | D=5120, # or any other generated value 2025-05-07T20:32:58.8778568Z | scale_ub=None, # or any other generated value 2025-05-07T20:32:58.8779067Z | contiguous=True, # or any other generated value 2025-05-07T20:32:58.8779627Z | compiled=True, # or any other generated value 2025-05-07T20:32:58.8780207Z | ) 2025-05-07T20:32:58.8780461Z | 2025-05-07T20:32:58.8781178Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case 2025-05-07T20:32:58.8782016Z +------------------------------------ 2025-05-07T20:32:58.8782593Z ---------------------------------- Hypothesis ---------------------------------- 2025-05-07T20:32:58.8783121Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.8783685Z self=, 2025-05-07T20:32:58.8784242Z T=1, 2025-05-07T20:32:58.8784493Z D=5120, 2025-05-07T20:32:58.8784759Z scale_ub=None, 2025-05-07T20:32:58.8785056Z contiguous=True, 2025-05-07T20:32:58.8785354Z compiled=True, 2025-05-07T20:32:58.8785644Z ) 2025-05-07T20:32:58.8786087Z self = 2025-05-07T20:32:58.8786761Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:58.8787122Z 2025-05-07T20:32:58.8787229Z @given( 2025-05-07T20:32:58.8787548Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.8787981Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.8788447Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.8788919Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.8789460Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.8790136Z ) 2025-05-07T20:32:58.8790636Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.8791235Z def test_silu_mul_quant( 2025-05-07T20:32:58.8791561Z self, 2025-05-07T20:32:58.8791818Z T: int, 2025-05-07T20:32:58.8792091Z D: int, 2025-05-07T20:32:58.8792400Z scale_ub: Optional[float], 2025-05-07T20:32:58.8792768Z contiguous: bool, 2025-05-07T20:32:58.8793109Z compiled: bool, 2025-05-07T20:32:58.8793429Z ) -> None: 2025-05-07T20:32:58.8793724Z torch.manual_seed(2025) 2025-05-07T20:32:58.8794067Z 2025-05-07T20:32:58.8794445Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.8794900Z 2025-05-07T20:32:58.8795160Z x_sign = torch.sign(x) 2025-05-07T20:32:58.8795558Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:58.8795977Z x = x_sign * x_clamp 2025-05-07T20:32:58.8796318Z x0 = x[:, :D] 2025-05-07T20:32:58.8796619Z x1 = x[:, D:] 2025-05-07T20:32:58.8796903Z 2025-05-07T20:32:58.8797159Z if contiguous: 2025-05-07T20:32:58.8797481Z x0 = x0.contiguous() 2025-05-07T20:32:58.8797835Z x1 = x1.contiguous() 2025-05-07T20:32:58.8798171Z 2025-05-07T20:32:58.8798438Z if scale_ub is not None: 2025-05-07T20:32:58.8799009Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:58.8799526Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:58.8799943Z ) 2025-05-07T20:32:58.8800205Z else: 2025-05-07T20:32:58.8800482Z scale_ub_tensor = None 2025-05-07T20:32:58.8800829Z 2025-05-07T20:32:58.8801141Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:58.8801554Z op = silu_mul_quant 2025-05-07T20:32:58.8801878Z if compiled: 2025-05-07T20:32:58.8802201Z op = torch.compile(op) 2025-05-07T20:32:58.8802575Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.8802924Z 2025-05-07T20:32:58.8803177Z y_fp8, y_scale = fn() 2025-05-07T20:32:58.8803541Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:58.8803911Z 2025-05-07T20:32:58.8804217Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:58.8804653Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:58.8805020Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:58.8805425Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:58.8805879Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:58.8806268Z 2025-05-07T20:32:58.8806522Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:58.8806774Z 2025-05-07T20:32:58.8807048Z moe/activation_test.py:126: 2025-05-07T20:32:58.8807431Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.8807861Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:58.8808280Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:58.8809290Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:58.8810274Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:58.8811031Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:58.8811963Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:58.8812905Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:58.8813881Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:58.8814901Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:58.8816003Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:58.8816991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:58.8817848Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:58.8818652Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:58.8819327Z fn() 2025-05-07T20:32:58.8820087Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:58.8820841Z self.fn.run( 2025-05-07T20:32:58.8821454Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:58.8822141Z kernel = self.compile( 2025-05-07T20:32:58.8822838Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:58.8823678Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:58.8824182Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.8824475Z 2025-05-07T20:32:58.8824736Z self = 2025-05-07T20:32:58.8826271Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:58.8828162Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f32572a4af0>} 2025-05-07T20:32:58.8829946Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:58.8831291Z context = 2025-05-07T20:32:58.8831660Z 2025-05-07T20:32:58.8831877Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:58.8832553Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:58.8833175Z module_map=module_map) 2025-05-07T20:32:58.8833656Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:58.8834142Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:58.8834504Z E ^ 2025-05-07T20:32:58.8835173Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:58.8835803Z 2025-05-07T20:32:58.8836376Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:58.8837089Z 2025-05-07T20:32:58.8837227Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.8837785Z self=, 2025-05-07T20:32:58.8838324Z T=2048, 2025-05-07T20:32:58.8838570Z D=5120, 2025-05-07T20:32:58.8838819Z scale_ub=1200.0, 2025-05-07T20:32:58.8839107Z contiguous=True, 2025-05-07T20:32:58.8839412Z compiled=False, 2025-05-07T20:32:58.8839686Z ) 2025-05-07T20:32:58.8840109Z self = 2025-05-07T20:32:58.8840762Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:58.8841138Z 2025-05-07T20:32:58.8841237Z @given( 2025-05-07T20:32:58.8841532Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.8841962Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.8842453Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.8842928Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.8843361Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.8843740Z ) 2025-05-07T20:32:58.8844203Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.8844807Z def test_silu_mul_quant( 2025-05-07T20:32:58.8845138Z self, 2025-05-07T20:32:58.8845399Z T: int, 2025-05-07T20:32:58.8865876Z D: int, 2025-05-07T20:32:58.8866174Z scale_ub: Optional[float], 2025-05-07T20:32:58.8866525Z contiguous: bool, 2025-05-07T20:32:58.8866831Z compiled: bool, 2025-05-07T20:32:58.8867122Z ) -> None: 2025-05-07T20:32:58.8867392Z torch.manual_seed(2025) 2025-05-07T20:32:58.8867706Z 2025-05-07T20:32:58.8868074Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.8868538Z 2025-05-07T20:32:58.8868792Z x_sign = torch.sign(x) 2025-05-07T20:32:58.8869179Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:58.8869592Z x = x_sign * x_clamp 2025-05-07T20:32:58.8869904Z x0 = x[:, :D] 2025-05-07T20:32:58.8870194Z x1 = x[:, D:] 2025-05-07T20:32:58.8870470Z 2025-05-07T20:32:58.8870711Z if contiguous: 2025-05-07T20:32:58.8871018Z x0 = x0.contiguous() 2025-05-07T20:32:58.8871471Z x1 = x1.contiguous() 2025-05-07T20:32:58.8871782Z 2025-05-07T20:32:58.8872046Z if scale_ub is not None: 2025-05-07T20:32:58.8872476Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:58.8872917Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:58.8873335Z ) 2025-05-07T20:32:58.8873597Z else: 2025-05-07T20:32:58.8873882Z scale_ub_tensor = None 2025-05-07T20:32:58.8874207Z 2025-05-07T20:32:58.8874522Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:58.8874945Z op = silu_mul_quant 2025-05-07T20:32:58.8875273Z if compiled: 2025-05-07T20:32:58.8875602Z op = torch.compile(op) 2025-05-07T20:32:58.8876000Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.8876369Z 2025-05-07T20:32:58.8876623Z > y_fp8, y_scale = fn() 2025-05-07T20:32:58.8876839Z 2025-05-07T20:32:58.8876983Z moe/activation_test.py:117: 2025-05-07T20:32:58.8877380Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.8877835Z moe/activation_test.py:115: in fn 2025-05-07T20:32:58.8878219Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.8879178Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:58.8880090Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:58.8880867Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:58.8881792Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:58.8882694Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:58.8883427Z kernel = self.compile( 2025-05-07T20:32:58.8884175Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:58.8885098Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:58.8885648Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.8885969Z 2025-05-07T20:32:58.8886240Z self = 2025-05-07T20:32:58.8887702Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:58.8889711Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3257181990>} 2025-05-07T20:32:58.8891831Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:58.8893264Z context = 2025-05-07T20:32:58.8893647Z 2025-05-07T20:32:58.8893871Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:58.8894565Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:58.8895194Z module_map=module_map) 2025-05-07T20:32:58.8895679Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:58.8896162Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:58.8896479Z E ^ 2025-05-07T20:32:58.8897112Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:58.8897729Z 2025-05-07T20:32:58.8898293Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:58.8899142Z 2025-05-07T20:32:58.8899290Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.8900043Z self=, 2025-05-07T20:32:58.8900593Z T=2048, 2025-05-07T20:32:58.8900849Z D=5120, 2025-05-07T20:32:58.8901106Z scale_ub=1200.0, 2025-05-07T20:32:58.8901400Z contiguous=True, 2025-05-07T20:32:58.8901703Z compiled=True, 2025-05-07T20:32:58.8901990Z ) 2025-05-07T20:32:58.8902421Z self = 2025-05-07T20:32:58.8903092Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:58.8903452Z 2025-05-07T20:32:58.8903559Z @given( 2025-05-07T20:32:58.8903881Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.8904302Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.8904719Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.8905181Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.8905654Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.8906045Z ) 2025-05-07T20:32:58.8906537Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.8906989Z def test_silu_mul_quant( 2025-05-07T20:32:58.8907225Z self, 2025-05-07T20:32:58.8907416Z T: int, 2025-05-07T20:32:58.8907611Z D: int, 2025-05-07T20:32:58.8907917Z scale_ub: Optional[float], 2025-05-07T20:32:58.8908182Z contiguous: bool, 2025-05-07T20:32:58.8908431Z compiled: bool, 2025-05-07T20:32:58.8908654Z ) -> None: 2025-05-07T20:32:58.8908862Z torch.manual_seed(2025) 2025-05-07T20:32:58.8909105Z 2025-05-07T20:32:58.8909379Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.8909713Z 2025-05-07T20:32:58.8909909Z x_sign = torch.sign(x) 2025-05-07T20:32:58.8910199Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:58.8910505Z x = x_sign * x_clamp 2025-05-07T20:32:58.8910744Z x0 = x[:, :D] 2025-05-07T20:32:58.8910958Z x1 = x[:, D:] 2025-05-07T20:32:58.8911157Z 2025-05-07T20:32:58.8911341Z if contiguous: 2025-05-07T20:32:58.8911571Z x0 = x0.contiguous() 2025-05-07T20:32:58.8911816Z x1 = x1.contiguous() 2025-05-07T20:32:58.8912053Z 2025-05-07T20:32:58.8912243Z if scale_ub is not None: 2025-05-07T20:32:58.8912506Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:58.8912916Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:58.8913218Z ) 2025-05-07T20:32:58.8913409Z else: 2025-05-07T20:32:58.8913614Z scale_ub_tensor = None 2025-05-07T20:32:58.8913860Z 2025-05-07T20:32:58.8914087Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:58.8914386Z op = silu_mul_quant 2025-05-07T20:32:58.8914636Z if compiled: 2025-05-07T20:32:58.8914886Z op = torch.compile(op) 2025-05-07T20:32:58.8915177Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.8915443Z 2025-05-07T20:32:58.8915636Z y_fp8, y_scale = fn() 2025-05-07T20:32:58.8915912Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:58.8916201Z 2025-05-07T20:32:58.8916434Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:58.8916760Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:58.8917052Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:58.8917361Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:58.8917722Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:58.8918019Z 2025-05-07T20:32:58.8918218Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:58.8918410Z 2025-05-07T20:32:58.8918513Z moe/activation_test.py:126: 2025-05-07T20:32:58.8918799Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.8919183Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:58.8919547Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:58.8920330Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:58.8921064Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:58.8921608Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:58.8922285Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:58.8922958Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:58.8923673Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:58.8924421Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:58.8925166Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:58.8925884Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:58.8926516Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:58.8927152Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:58.8927672Z fn() 2025-05-07T20:32:58.8928171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:58.8928750Z self.fn.run( 2025-05-07T20:32:58.8929210Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:58.8929790Z kernel = self.compile( 2025-05-07T20:32:58.8930334Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:58.8931048Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:58.8931441Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.8931726Z 2025-05-07T20:32:58.8931934Z self = 2025-05-07T20:32:58.8933008Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:58.8934470Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3255c1d3f0>} 2025-05-07T20:32:58.8935928Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:58.8936955Z context = 2025-05-07T20:32:58.8937242Z 2025-05-07T20:32:58.8937405Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:58.8937935Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:58.8938400Z module_map=module_map) 2025-05-07T20:32:58.8938763Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:58.8939116Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:58.8939379Z E ^ 2025-05-07T20:32:58.8939954Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:58.8940401Z 2025-05-07T20:32:58.8940812Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:58.8941382Z 2025-05-07T20:32:58.8941542Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.8941955Z self=, 2025-05-07T20:32:58.8942344Z T=16384, 2025-05-07T20:32:58.8942535Z D=7168, 2025-05-07T20:32:58.8942725Z scale_ub=1200.0, 2025-05-07T20:32:58.8942948Z contiguous=False, 2025-05-07T20:32:58.8943175Z compiled=False, 2025-05-07T20:32:58.8943382Z ) 2025-05-07T20:32:58.8943701Z self = 2025-05-07T20:32:58.8944191Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:58.8944478Z 2025-05-07T20:32:58.8944554Z @given( 2025-05-07T20:32:58.8944784Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.8945085Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.8945393Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.8945721Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.8946042Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.8946322Z ) 2025-05-07T20:32:58.8947040Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.8947485Z def test_silu_mul_quant( 2025-05-07T20:32:58.8947717Z self, 2025-05-07T20:32:58.8947995Z T: int, 2025-05-07T20:32:58.8948196Z D: int, 2025-05-07T20:32:58.8948408Z scale_ub: Optional[float], 2025-05-07T20:32:58.8948677Z contiguous: bool, 2025-05-07T20:32:58.8948913Z compiled: bool, 2025-05-07T20:32:58.8949129Z ) -> None: 2025-05-07T20:32:58.8949346Z torch.manual_seed(2025) 2025-05-07T20:32:58.8949596Z 2025-05-07T20:32:58.8949866Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.8950197Z 2025-05-07T20:32:58.8950398Z x_sign = torch.sign(x) 2025-05-07T20:32:58.8950690Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:58.8950997Z x = x_sign * x_clamp 2025-05-07T20:32:58.8951238Z x0 = x[:, :D] 2025-05-07T20:32:58.8951460Z x1 = x[:, D:] 2025-05-07T20:32:58.8951659Z 2025-05-07T20:32:58.8951845Z if contiguous: 2025-05-07T20:32:58.8952079Z x0 = x0.contiguous() 2025-05-07T20:32:58.8952334Z x1 = x1.contiguous() 2025-05-07T20:32:58.8952575Z 2025-05-07T20:32:58.8952821Z if scale_ub is not None: 2025-05-07T20:32:58.8953087Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:58.8953421Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:58.8953727Z ) 2025-05-07T20:32:58.8953913Z else: 2025-05-07T20:32:58.8954124Z scale_ub_tensor = None 2025-05-07T20:32:58.8954372Z 2025-05-07T20:32:58.8954601Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:58.8954905Z op = silu_mul_quant 2025-05-07T20:32:58.8955153Z if compiled: 2025-05-07T20:32:58.8955401Z op = torch.compile(op) 2025-05-07T20:32:58.8955686Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.8955956Z 2025-05-07T20:32:58.8956145Z > y_fp8, y_scale = fn() 2025-05-07T20:32:58.8956306Z 2025-05-07T20:32:58.8956404Z moe/activation_test.py:117: 2025-05-07T20:32:58.8956699Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.8957030Z moe/activation_test.py:115: in fn 2025-05-07T20:32:58.8957303Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.8958012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:58.8958720Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:58.8959253Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:58.8960018Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:58.8960676Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:58.8961206Z kernel = self.compile( 2025-05-07T20:32:58.8961742Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:58.8962381Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:58.8962774Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.8962996Z 2025-05-07T20:32:58.8963212Z self = 2025-05-07T20:32:58.8964269Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:58.8965639Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3255c1ce50>} 2025-05-07T20:32:58.8967006Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:58.8968034Z context = 2025-05-07T20:32:58.8968314Z 2025-05-07T20:32:58.8968485Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:58.8968993Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:58.8969459Z module_map=module_map) 2025-05-07T20:32:58.8969825Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:58.8970180Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:58.8970435Z E ^ 2025-05-07T20:32:58.8970899Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:58.8971339Z 2025-05-07T20:32:58.8971757Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:58.8972268Z 2025-05-07T20:32:58.8972372Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.8972826Z self=, 2025-05-07T20:32:58.8973225Z T=1, 2025-05-07T20:32:58.8973410Z D=7168, 2025-05-07T20:32:58.8973595Z scale_ub=None, 2025-05-07T20:32:58.8973807Z contiguous=True, 2025-05-07T20:32:58.8974027Z compiled=True, 2025-05-07T20:32:58.8974221Z ) 2025-05-07T20:32:58.8974538Z self = 2025-05-07T20:32:58.8975021Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:58.8975276Z 2025-05-07T20:32:58.8975354Z @given( 2025-05-07T20:32:58.8975583Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.8975889Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.8976182Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.8976511Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.8976839Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.8977125Z ) 2025-05-07T20:32:58.8977464Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.8977896Z def test_silu_mul_quant( 2025-05-07T20:32:58.8978132Z self, 2025-05-07T20:32:58.8978319Z T: int, 2025-05-07T20:32:58.8978522Z D: int, 2025-05-07T20:32:58.8978775Z scale_ub: Optional[float], 2025-05-07T20:32:58.8979038Z contiguous: bool, 2025-05-07T20:32:58.8979323Z compiled: bool, 2025-05-07T20:32:58.8979545Z ) -> None: 2025-05-07T20:32:58.8979922Z torch.manual_seed(2025) 2025-05-07T20:32:58.8980173Z 2025-05-07T20:32:58.8980444Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.8980774Z 2025-05-07T20:32:58.8980975Z x_sign = torch.sign(x) 2025-05-07T20:32:58.8981268Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:58.8981571Z x = x_sign * x_clamp 2025-05-07T20:32:58.8981815Z x0 = x[:, :D] 2025-05-07T20:32:58.8982030Z x1 = x[:, D:] 2025-05-07T20:32:58.8982237Z 2025-05-07T20:32:58.8982417Z if contiguous: 2025-05-07T20:32:58.8982654Z x0 = x0.contiguous() 2025-05-07T20:32:58.8982911Z x1 = x1.contiguous() 2025-05-07T20:32:58.8983143Z 2025-05-07T20:32:58.8983339Z if scale_ub is not None: 2025-05-07T20:32:58.8983610Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:58.8983941Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:58.8984246Z ) 2025-05-07T20:32:58.8984440Z else: 2025-05-07T20:32:58.8984647Z scale_ub_tensor = None 2025-05-07T20:32:58.8984905Z 2025-05-07T20:32:58.8985136Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:58.8985445Z op = silu_mul_quant 2025-05-07T20:32:58.8985695Z if compiled: 2025-05-07T20:32:58.8985999Z op = torch.compile(op) 2025-05-07T20:32:58.8986295Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.8986567Z 2025-05-07T20:32:58.8986757Z y_fp8, y_scale = fn() 2025-05-07T20:32:58.8987047Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:58.8987331Z 2025-05-07T20:32:58.8987579Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:58.8987910Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:58.8988194Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:58.8988509Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:58.8988871Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:58.8989171Z 2025-05-07T20:32:58.8989374Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:58.8989566Z 2025-05-07T20:32:58.8989670Z moe/activation_test.py:126: 2025-05-07T20:32:58.8990292Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.8990633Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:58.8991055Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:58.8991836Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:58.8992575Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:58.8993117Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:58.8993793Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:58.8994474Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:58.8995183Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:58.8995930Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:58.8996674Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:58.8997402Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:58.8998080Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:58.8998683Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:58.8999273Z fn() 2025-05-07T20:32:58.8999829Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:58.9000401Z self.fn.run( 2025-05-07T20:32:58.9000875Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:58.9001407Z kernel = self.compile( 2025-05-07T20:32:58.9001939Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:58.9002596Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:58.9002987Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.9003212Z 2025-05-07T20:32:58.9003419Z self = 2025-05-07T20:32:58.9004489Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:58.9005866Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f32559b8c10>} 2025-05-07T20:32:58.9007258Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:58.9008293Z context = 2025-05-07T20:32:58.9008581Z 2025-05-07T20:32:58.9008746Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:58.9009264Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:58.9009736Z module_map=module_map) 2025-05-07T20:32:58.9010099Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:58.9010451Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:58.9010715Z E ^ 2025-05-07T20:32:58.9011172Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:58.9011613Z 2025-05-07T20:32:58.9012026Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:58.9012582Z 2025-05-07T20:32:58.9012687Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.9013093Z self=, 2025-05-07T20:32:58.9013492Z T=4096, 2025-05-07T20:32:58.9013673Z D=5120, 2025-05-07T20:32:58.9013866Z scale_ub=None, 2025-05-07T20:32:58.9014081Z contiguous=False, 2025-05-07T20:32:58.9014300Z compiled=False, 2025-05-07T20:32:58.9014512Z ) 2025-05-07T20:32:58.9014828Z self = 2025-05-07T20:32:58.9015314Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:58.9016047Z 2025-05-07T20:32:58.9016163Z @given( 2025-05-07T20:32:58.9016437Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.9016746Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.9017063Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.9017396Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.9017765Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.9018082Z ) 2025-05-07T20:32:58.9018502Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.9018967Z def test_silu_mul_quant( 2025-05-07T20:32:58.9019213Z self, 2025-05-07T20:32:58.9019404Z T: int, 2025-05-07T20:32:58.9019607Z D: int, 2025-05-07T20:32:58.9020001Z scale_ub: Optional[float], 2025-05-07T20:32:58.9020279Z contiguous: bool, 2025-05-07T20:32:58.9020584Z compiled: bool, 2025-05-07T20:32:58.9020826Z ) -> None: 2025-05-07T20:32:58.9021048Z torch.manual_seed(2025) 2025-05-07T20:32:58.9021290Z 2025-05-07T20:32:58.9021570Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.9021918Z 2025-05-07T20:32:58.9022112Z x_sign = torch.sign(x) 2025-05-07T20:32:58.9022416Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:58.9022737Z x = x_sign * x_clamp 2025-05-07T20:32:58.9034090Z x0 = x[:, :D] 2025-05-07T20:32:58.9034363Z x1 = x[:, D:] 2025-05-07T20:32:58.9034643Z 2025-05-07T20:32:58.9034836Z if contiguous: 2025-05-07T20:32:58.9035068Z x0 = x0.contiguous() 2025-05-07T20:32:58.9035341Z x1 = x1.contiguous() 2025-05-07T20:32:58.9035594Z 2025-05-07T20:32:58.9035796Z if scale_ub is not None: 2025-05-07T20:32:58.9036080Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:58.9036434Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:58.9036752Z ) 2025-05-07T20:32:58.9036957Z else: 2025-05-07T20:32:58.9037182Z scale_ub_tensor = None 2025-05-07T20:32:58.9037434Z 2025-05-07T20:32:58.9037680Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:58.9038130Z op = silu_mul_quant 2025-05-07T20:32:58.9038411Z if compiled: 2025-05-07T20:32:58.9038660Z op = torch.compile(op) 2025-05-07T20:32:58.9038967Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.9039245Z 2025-05-07T20:32:58.9039439Z > y_fp8, y_scale = fn() 2025-05-07T20:32:58.9039614Z 2025-05-07T20:32:58.9039724Z moe/activation_test.py:117: 2025-05-07T20:32:58.9040030Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.9040366Z moe/activation_test.py:115: in fn 2025-05-07T20:32:58.9040651Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.9041356Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:58.9042048Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:58.9042590Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:58.9043280Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:58.9044605Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:58.9045134Z kernel = self.compile( 2025-05-07T20:32:58.9045683Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:58.9046343Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:58.9046748Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.9046983Z 2025-05-07T20:32:58.9047192Z self = 2025-05-07T20:32:58.9047979Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:58.9048482Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f32559b9a20>} 2025-05-07T20:32:58.9049230Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:58.9049472Z context = 2025-05-07T20:32:58.9049478Z 2025-05-07T20:32:58.9049689Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:58.9049959Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:58.9050070Z module_map=module_map) 2025-05-07T20:32:58.9050240Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:58.9050346Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:58.9050429Z E ^ 2025-05-07T20:32:58.9050791Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:58.9050796Z 2025-05-07T20:32:58.9051208Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:58.9051213Z 2025-05-07T20:32:58.9051328Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.9051553Z self=, 2025-05-07T20:32:58.9051632Z T=4096, 2025-05-07T20:32:58.9051721Z D=7168, 2025-05-07T20:32:58.9051806Z scale_ub=None, 2025-05-07T20:32:58.9051895Z contiguous=False, 2025-05-07T20:32:58.9051989Z compiled=False, 2025-05-07T20:32:58.9052067Z ) 2025-05-07T20:32:58.9052284Z self = 2025-05-07T20:32:58.9052511Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:58.9052518Z 2025-05-07T20:32:58.9052597Z @given( 2025-05-07T20:32:58.9052725Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.9052827Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.9052943Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.9053073Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.9053192Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.9053272Z ) 2025-05-07T20:32:58.9053528Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.9053630Z def test_silu_mul_quant( 2025-05-07T20:32:58.9053710Z self, 2025-05-07T20:32:58.9053803Z T: int, 2025-05-07T20:32:58.9053883Z D: int, 2025-05-07T20:32:58.9053997Z scale_ub: Optional[float], 2025-05-07T20:32:58.9054089Z contiguous: bool, 2025-05-07T20:32:58.9054181Z compiled: bool, 2025-05-07T20:32:58.9054316Z ) -> None: 2025-05-07T20:32:58.9054415Z torch.manual_seed(2025) 2025-05-07T20:32:58.9054492Z 2025-05-07T20:32:58.9054665Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.9054744Z 2025-05-07T20:32:58.9054847Z x_sign = torch.sign(x) 2025-05-07T20:32:58.9054972Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:58.9055071Z x = x_sign * x_clamp 2025-05-07T20:32:58.9055157Z x0 = x[:, :D] 2025-05-07T20:32:58.9055241Z x1 = x[:, D:] 2025-05-07T20:32:58.9055322Z 2025-05-07T20:32:58.9055410Z if contiguous: 2025-05-07T20:32:58.9055504Z x0 = x0.contiguous() 2025-05-07T20:32:58.9055601Z x1 = x1.contiguous() 2025-05-07T20:32:58.9055672Z 2025-05-07T20:32:58.9055765Z if scale_ub is not None: 2025-05-07T20:32:58.9055882Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:58.9056024Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:58.9056104Z ) 2025-05-07T20:32:58.9056193Z else: 2025-05-07T20:32:58.9056289Z scale_ub_tensor = None 2025-05-07T20:32:58.9056364Z 2025-05-07T20:32:58.9056504Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:58.9056595Z op = silu_mul_quant 2025-05-07T20:32:58.9056689Z if compiled: 2025-05-07T20:32:58.9056793Z op = torch.compile(op) 2025-05-07T20:32:58.9056949Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.9057031Z 2025-05-07T20:32:58.9057124Z > y_fp8, y_scale = fn() 2025-05-07T20:32:58.9057168Z 2025-05-07T20:32:58.9057269Z moe/activation_test.py:117: 2025-05-07T20:32:58.9057408Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.9057513Z moe/activation_test.py:115: in fn 2025-05-07T20:32:58.9057613Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.9058122Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:58.9058223Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:58.9058588Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:58.9058809Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:58.9059148Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:58.9059254Z kernel = self.compile( 2025-05-07T20:32:58.9059645Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:58.9059962Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:58.9060092Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.9060185Z 2025-05-07T20:32:58.9060394Z self = 2025-05-07T20:32:58.9061172Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:58.9061667Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f32559ba560>} 2025-05-07T20:32:58.9062420Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:58.9062608Z context = 2025-05-07T20:32:58.9062613Z 2025-05-07T20:32:58.9062781Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:58.9063104Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:58.9063212Z module_map=module_map) 2025-05-07T20:32:58.9063382Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:58.9063481Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:58.9063559Z E ^ 2025-05-07T20:32:58.9063917Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:58.9063923Z 2025-05-07T20:32:58.9064344Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:58.9064349Z 2025-05-07T20:32:58.9064460Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.9064681Z self=, 2025-05-07T20:32:58.9064761Z T=128, 2025-05-07T20:32:58.9064847Z D=7168, 2025-05-07T20:32:58.9064933Z scale_ub=None, 2025-05-07T20:32:58.9065020Z contiguous=False, 2025-05-07T20:32:58.9065110Z compiled=True, 2025-05-07T20:32:58.9065184Z ) 2025-05-07T20:32:58.9065400Z self = 2025-05-07T20:32:58.9065578Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:58.9065582Z 2025-05-07T20:32:58.9065660Z @given( 2025-05-07T20:32:58.9065787Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.9065935Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.9066089Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.9066216Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.9066331Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.9066408Z ) 2025-05-07T20:32:58.9066671Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.9066769Z def test_silu_mul_quant( 2025-05-07T20:32:58.9066852Z self, 2025-05-07T20:32:58.9066938Z T: int, 2025-05-07T20:32:58.9067016Z D: int, 2025-05-07T20:32:58.9067118Z scale_ub: Optional[float], 2025-05-07T20:32:58.9067220Z contiguous: bool, 2025-05-07T20:32:58.9067308Z compiled: bool, 2025-05-07T20:32:58.9067395Z ) -> None: 2025-05-07T20:32:58.9067494Z torch.manual_seed(2025) 2025-05-07T20:32:58.9067569Z 2025-05-07T20:32:58.9067752Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.9067829Z 2025-05-07T20:32:58.9067926Z x_sign = torch.sign(x) 2025-05-07T20:32:58.9068057Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:58.9068149Z x = x_sign * x_clamp 2025-05-07T20:32:58.9068233Z x0 = x[:, :D] 2025-05-07T20:32:58.9068324Z x1 = x[:, D:] 2025-05-07T20:32:58.9068399Z 2025-05-07T20:32:58.9068529Z if contiguous: 2025-05-07T20:32:58.9068632Z x0 = x0.contiguous() 2025-05-07T20:32:58.9068725Z x1 = x1.contiguous() 2025-05-07T20:32:58.9068805Z 2025-05-07T20:32:58.9068898Z if scale_ub is not None: 2025-05-07T20:32:58.9069003Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:58.9069144Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:58.9069222Z ) 2025-05-07T20:32:58.9069300Z else: 2025-05-07T20:32:58.9069402Z scale_ub_tensor = None 2025-05-07T20:32:58.9069478Z 2025-05-07T20:32:58.9069609Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:58.9069711Z op = silu_mul_quant 2025-05-07T20:32:58.9069797Z if compiled: 2025-05-07T20:32:58.9069901Z op = torch.compile(op) 2025-05-07T20:32:58.9070015Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.9070090Z 2025-05-07T20:32:58.9070183Z y_fp8, y_scale = fn() 2025-05-07T20:32:58.9070317Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:58.9070435Z 2025-05-07T20:32:58.9070580Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:58.9070685Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:58.9070787Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:58.9070918Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:58.9071059Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:58.9071139Z 2025-05-07T20:32:58.9071250Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:58.9071255Z 2025-05-07T20:32:58.9071358Z moe/activation_test.py:126: 2025-05-07T20:32:58.9071494Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.9071606Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:58.9071743Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:58.9072309Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:58.9072413Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:58.9072771Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:58.9073000Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:58.9073368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:58.9073739Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:58.9074144Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:58.9074395Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:58.9074777Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:58.9074947Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:58.9075294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:58.9075374Z fn() 2025-05-07T20:32:58.9075773Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:58.9075869Z self.fn.run( 2025-05-07T20:32:58.9076214Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:58.9076312Z kernel = self.compile( 2025-05-07T20:32:58.9076695Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:58.9076870Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:58.9077047Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.9077054Z 2025-05-07T20:32:58.9077262Z self = 2025-05-07T20:32:58.9078031Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:58.9078544Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f32559d24d0>} 2025-05-07T20:32:58.9079283Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:58.9079482Z context = 2025-05-07T20:32:58.9079527Z 2025-05-07T20:32:58.9079692Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:58.9079959Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:58.9080075Z module_map=module_map) 2025-05-07T20:32:58.9080238Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:58.9080348Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:58.9080430Z E ^ 2025-05-07T20:32:58.9080785Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:58.9080791Z 2025-05-07T20:32:58.9081207Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:58.9081212Z 2025-05-07T20:32:58.9081320Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.9081554Z self=, 2025-05-07T20:32:58.9081636Z T=128, 2025-05-07T20:32:58.9081713Z D=7168, 2025-05-07T20:32:58.9081807Z scale_ub=None, 2025-05-07T20:32:58.9081894Z contiguous=False, 2025-05-07T20:32:58.9081983Z compiled=False, 2025-05-07T20:32:58.9082065Z ) 2025-05-07T20:32:58.9082284Z self = 2025-05-07T20:32:58.9082459Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:58.9082506Z 2025-05-07T20:32:58.9082592Z @given( 2025-05-07T20:32:58.9082750Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.9082858Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.9082976Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.9083093Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.9083217Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.9083296Z ) 2025-05-07T20:32:58.9083541Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.9083644Z def test_silu_mul_quant( 2025-05-07T20:32:58.9083723Z self, 2025-05-07T20:32:58.9083801Z T: int, 2025-05-07T20:32:58.9083884Z D: int, 2025-05-07T20:32:58.9083984Z scale_ub: Optional[float], 2025-05-07T20:32:58.9084074Z contiguous: bool, 2025-05-07T20:32:58.9084165Z compiled: bool, 2025-05-07T20:32:58.9084246Z ) -> None: 2025-05-07T20:32:58.9084351Z torch.manual_seed(2025) 2025-05-07T20:32:58.9084425Z 2025-05-07T20:32:58.9084597Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.9084676Z 2025-05-07T20:32:58.9084770Z x_sign = torch.sign(x) 2025-05-07T20:32:58.9084898Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:58.9084995Z x = x_sign * x_clamp 2025-05-07T20:32:58.9085077Z x0 = x[:, :D] 2025-05-07T20:32:58.9085203Z x1 = x[:, D:] 2025-05-07T20:32:58.9085287Z 2025-05-07T20:32:58.9085373Z if contiguous: 2025-05-07T20:32:58.9085469Z x0 = x0.contiguous() 2025-05-07T20:32:58.9085570Z x1 = x1.contiguous() 2025-05-07T20:32:58.9085644Z 2025-05-07T20:32:58.9085741Z if scale_ub is not None: 2025-05-07T20:32:58.9085855Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:58.9085991Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:58.9086078Z ) 2025-05-07T20:32:58.9086158Z else: 2025-05-07T20:32:58.9086255Z scale_ub_tensor = None 2025-05-07T20:32:58.9086341Z 2025-05-07T20:32:58.9086472Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:58.9086563Z op = silu_mul_quant 2025-05-07T20:32:58.9086658Z if compiled: 2025-05-07T20:32:58.9086760Z op = torch.compile(op) 2025-05-07T20:32:58.9086870Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.9086950Z 2025-05-07T20:32:58.9087087Z > y_fp8, y_scale = fn() 2025-05-07T20:32:58.9087091Z 2025-05-07T20:32:58.9087197Z moe/activation_test.py:117: 2025-05-07T20:32:58.9087326Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.9087433Z moe/activation_test.py:115: in fn 2025-05-07T20:32:58.9087542Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.9088051Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:58.9088166Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:58.9088555Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:58.9088775Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:58.9089124Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:58.9089221Z kernel = self.compile( 2025-05-07T20:32:58.9089601Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:58.9089782Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:58.9090186Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.9090195Z 2025-05-07T20:32:58.9090455Z self = 2025-05-07T20:32:58.9091464Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:58.9091976Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3255a2e830>} 2025-05-07T20:32:58.9092727Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:58.9092918Z context = 2025-05-07T20:32:58.9092923Z 2025-05-07T20:32:58.9093097Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:58.9093364Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:58.9093476Z module_map=module_map) 2025-05-07T20:32:58.9093651Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:58.9093754Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:58.9093833Z E ^ 2025-05-07T20:32:58.9094259Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:58.9094266Z 2025-05-07T20:32:58.9094686Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:58.9094691Z 2025-05-07T20:32:58.9094804Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.9095024Z self=, 2025-05-07T20:32:58.9095117Z T=4096, 2025-05-07T20:32:58.9095194Z D=5120, 2025-05-07T20:32:58.9095281Z scale_ub=1200.0, 2025-05-07T20:32:58.9095376Z contiguous=True, 2025-05-07T20:32:58.9095462Z compiled=False, 2025-05-07T20:32:58.9095537Z ) 2025-05-07T20:32:58.9095764Z self = 2025-05-07T20:32:58.9095939Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:58.9095944Z 2025-05-07T20:32:58.9096032Z @given( 2025-05-07T20:32:58.9096156Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.9096257Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.9096455Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.9096576Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.9096693Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.9096776Z ) 2025-05-07T20:32:58.9097023Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.9097118Z def test_silu_mul_quant( 2025-05-07T20:32:58.9097207Z self, 2025-05-07T20:32:58.9097286Z T: int, 2025-05-07T20:32:58.9097372Z D: int, 2025-05-07T20:32:58.9097482Z scale_ub: Optional[float], 2025-05-07T20:32:58.9097574Z contiguous: bool, 2025-05-07T20:32:58.9097667Z compiled: bool, 2025-05-07T20:32:58.9097749Z ) -> None: 2025-05-07T20:32:58.9097846Z torch.manual_seed(2025) 2025-05-07T20:32:58.9097928Z 2025-05-07T20:32:58.9098101Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.9098180Z 2025-05-07T20:32:58.9098279Z x_sign = torch.sign(x) 2025-05-07T20:32:58.9098406Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:58.9098498Z x = x_sign * x_clamp 2025-05-07T20:32:58.9098588Z x0 = x[:, :D] 2025-05-07T20:32:58.9098673Z x1 = x[:, D:] 2025-05-07T20:32:58.9098748Z 2025-05-07T20:32:58.9098839Z if contiguous: 2025-05-07T20:32:58.9098933Z x0 = x0.contiguous() 2025-05-07T20:32:58.9099078Z x1 = x1.contiguous() 2025-05-07T20:32:58.9099151Z 2025-05-07T20:32:58.9099360Z if scale_ub is not None: 2025-05-07T20:32:58.9099473Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:58.9099609Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:58.9099687Z ) 2025-05-07T20:32:58.9099856Z else: 2025-05-07T20:32:58.9099951Z scale_ub_tensor = None 2025-05-07T20:32:58.9100024Z 2025-05-07T20:32:58.9100161Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:58.9100249Z op = silu_mul_quant 2025-05-07T20:32:58.9100333Z if compiled: 2025-05-07T20:32:58.9100440Z op = torch.compile(op) 2025-05-07T20:32:58.9100546Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.9100622Z 2025-05-07T20:32:58.9100714Z > y_fp8, y_scale = fn() 2025-05-07T20:32:58.9100718Z 2025-05-07T20:32:58.9100814Z moe/activation_test.py:117: 2025-05-07T20:32:58.9100952Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.9101056Z moe/activation_test.py:115: in fn 2025-05-07T20:32:58.9101157Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.9101658Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:58.9101802Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:58.9102165Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:58.9102392Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:58.9102736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:58.9102836Z kernel = self.compile( 2025-05-07T20:32:58.9103213Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:58.9103392Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:58.9103527Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.9103532Z 2025-05-07T20:32:58.9103738Z self = 2025-05-07T20:32:58.9104512Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:58.9105084Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3255a2e050>} 2025-05-07T20:32:58.9105827Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:58.9106018Z context = 2025-05-07T20:32:58.9106022Z 2025-05-07T20:32:58.9106188Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:58.9106453Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:58.9106562Z module_map=module_map) 2025-05-07T20:32:58.9106724Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:58.9106828Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:58.9106901Z E ^ 2025-05-07T20:32:58.9107263Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:58.9107268Z 2025-05-07T20:32:58.9107678Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:58.9107726Z 2025-05-07T20:32:58.9107828Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.9108095Z self=, 2025-05-07T20:32:58.9108175Z T=1, 2025-05-07T20:32:58.9108253Z D=5120, 2025-05-07T20:32:58.9108336Z scale_ub=None, 2025-05-07T20:32:58.9108420Z contiguous=True, 2025-05-07T20:32:58.9108508Z compiled=True, 2025-05-07T20:32:58.9108581Z ) 2025-05-07T20:32:58.9108796Z self = 2025-05-07T20:32:58.9108965Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:58.9108970Z 2025-05-07T20:32:58.9109043Z @given( 2025-05-07T20:32:58.9109162Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.9109267Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.9109380Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.9109508Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.9109621Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.9109698Z ) 2025-05-07T20:32:58.9109951Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.9110044Z def test_silu_mul_quant( 2025-05-07T20:32:58.9110121Z self, 2025-05-07T20:32:58.9110205Z T: int, 2025-05-07T20:32:58.9110279Z D: int, 2025-05-07T20:32:58.9110421Z scale_ub: Optional[float], 2025-05-07T20:32:58.9110521Z contiguous: bool, 2025-05-07T20:32:58.9110607Z compiled: bool, 2025-05-07T20:32:58.9110684Z ) -> None: 2025-05-07T20:32:58.9110785Z torch.manual_seed(2025) 2025-05-07T20:32:58.9110859Z 2025-05-07T20:32:58.9111034Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.9111107Z 2025-05-07T20:32:58.9111201Z x_sign = torch.sign(x) 2025-05-07T20:32:58.9111331Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:58.9111422Z x = x_sign * x_clamp 2025-05-07T20:32:58.9111504Z x0 = x[:, :D] 2025-05-07T20:32:58.9111593Z x1 = x[:, D:] 2025-05-07T20:32:58.9111664Z 2025-05-07T20:32:58.9111746Z if contiguous: 2025-05-07T20:32:58.9111847Z x0 = x0.contiguous() 2025-05-07T20:32:58.9111935Z x1 = x1.contiguous() 2025-05-07T20:32:58.9112009Z 2025-05-07T20:32:58.9112109Z if scale_ub is not None: 2025-05-07T20:32:58.9112214Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:58.9112398Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:58.9112477Z ) 2025-05-07T20:32:58.9112554Z else: 2025-05-07T20:32:58.9112656Z scale_ub_tensor = None 2025-05-07T20:32:58.9112730Z 2025-05-07T20:32:58.9112858Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:58.9112952Z op = silu_mul_quant 2025-05-07T20:32:58.9113041Z if compiled: 2025-05-07T20:32:58.9113140Z op = torch.compile(op) 2025-05-07T20:32:58.9113256Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.9113326Z 2025-05-07T20:32:58.9113418Z y_fp8, y_scale = fn() 2025-05-07T20:32:58.9113544Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:58.9113616Z 2025-05-07T20:32:58.9113751Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:58.9113860Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:58.9113961Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:58.9114089Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:58.9114227Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:58.9114300Z 2025-05-07T20:32:58.9114408Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:58.9114413Z 2025-05-07T20:32:58.9114509Z moe/activation_test.py:126: 2025-05-07T20:32:58.9114634Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.9114794Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:58.9114972Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:58.9115539Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:58.9115642Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:58.9116003Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:58.9116232Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:58.9116600Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:58.9116853Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:58.9117254Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:58.9117507Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:58.9117884Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:58.9118048Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:58.9118425Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:58.9118510Z fn() 2025-05-07T20:32:58.9118905Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:58.9118994Z self.fn.run( 2025-05-07T20:32:58.9119328Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:58.9119422Z kernel = self.compile( 2025-05-07T20:32:58.9119815Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:58.9119991Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:58.9120114Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.9120126Z 2025-05-07T20:32:58.9120333Z self = 2025-05-07T20:32:58.9121101Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:58.9121653Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3255a2f250>} 2025-05-07T20:32:58.9122403Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:58.9122598Z context = 2025-05-07T20:32:58.9122603Z 2025-05-07T20:32:58.9122766Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:58.9123027Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:58.9123142Z module_map=module_map) 2025-05-07T20:32:58.9123302Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:58.9123402Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:58.9123482Z E ^ 2025-05-07T20:32:58.9123833Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:58.9123838Z 2025-05-07T20:32:58.9124306Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:58.9124348Z 2025-05-07T20:32:58.9124455Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.9124672Z self=, 2025-05-07T20:32:58.9124750Z T=2048, 2025-05-07T20:32:58.9124828Z D=5120, 2025-05-07T20:32:58.9124915Z scale_ub=None, 2025-05-07T20:32:58.9125007Z contiguous=True, 2025-05-07T20:32:58.9125091Z compiled=True, 2025-05-07T20:32:58.9125169Z ) 2025-05-07T20:32:58.9125383Z self = 2025-05-07T20:32:58.9125554Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:58.9125558Z 2025-05-07T20:32:58.9125640Z @given( 2025-05-07T20:32:58.9125758Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.9125855Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.9125984Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.9126102Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.9126223Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.9126296Z ) 2025-05-07T20:32:58.9126544Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.9126644Z def test_silu_mul_quant( 2025-05-07T20:32:58.9126766Z self, 2025-05-07T20:32:58.9126842Z T: int, 2025-05-07T20:32:58.9126926Z D: int, 2025-05-07T20:32:58.9127026Z scale_ub: Optional[float], 2025-05-07T20:32:58.9127118Z contiguous: bool, 2025-05-07T20:32:58.9127209Z compiled: bool, 2025-05-07T20:32:58.9127284Z ) -> None: 2025-05-07T20:32:58.9127377Z torch.manual_seed(2025) 2025-05-07T20:32:58.9127454Z 2025-05-07T20:32:58.9127621Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.9127703Z 2025-05-07T20:32:58.9127795Z x_sign = torch.sign(x) 2025-05-07T20:32:58.9127918Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:58.9128013Z x = x_sign * x_clamp 2025-05-07T20:32:58.9128091Z x0 = x[:, :D] 2025-05-07T20:32:58.9128170Z x1 = x[:, D:] 2025-05-07T20:32:58.9128246Z 2025-05-07T20:32:58.9128330Z if contiguous: 2025-05-07T20:32:58.9128424Z x0 = x0.contiguous() 2025-05-07T20:32:58.9128523Z x1 = x1.contiguous() 2025-05-07T20:32:58.9128639Z 2025-05-07T20:32:58.9128730Z if scale_ub is not None: 2025-05-07T20:32:58.9128844Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:58.9128978Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:58.9129052Z ) 2025-05-07T20:32:58.9129137Z else: 2025-05-07T20:32:58.9129232Z scale_ub_tensor = None 2025-05-07T20:32:58.9129307Z 2025-05-07T20:32:58.9129436Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:58.9129528Z op = silu_mul_quant 2025-05-07T20:32:58.9129617Z if compiled: 2025-05-07T20:32:58.9129719Z op = torch.compile(op) 2025-05-07T20:32:58.9129825Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.9129899Z 2025-05-07T20:32:58.9129987Z y_fp8, y_scale = fn() 2025-05-07T20:32:58.9130108Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:58.9130185Z 2025-05-07T20:32:58.9130322Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:58.9131032Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:58.9131130Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:58.9131251Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:58.9131393Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:58.9131461Z 2025-05-07T20:32:58.9131560Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:58.9131614Z 2025-05-07T20:32:58.9131716Z moe/activation_test.py:126: 2025-05-07T20:32:58.9131882Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.9131990Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:58.9132129Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:58.9132682Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:58.9132792Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:58.9133151Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:58.9133370Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:58.9133738Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:58.9133997Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:58.9134404Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:58.9134656Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:58.9135091Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:58.9135266Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:58.9135610Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:58.9135685Z fn() 2025-05-07T20:32:58.9136087Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:58.9136167Z self.fn.run( 2025-05-07T20:32:58.9136507Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:58.9136605Z kernel = self.compile( 2025-05-07T20:32:58.9136985Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:58.9137167Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:58.9137295Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.9137300Z 2025-05-07T20:32:58.9137551Z self = 2025-05-07T20:32:58.9138320Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:58.9138815Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f32554ebbe0>} 2025-05-07T20:32:58.9139577Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:58.9139874Z context = 2025-05-07T20:32:58.9139879Z 2025-05-07T20:32:58.9140056Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:58.9140318Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:58.9140424Z module_map=module_map) 2025-05-07T20:32:58.9140593Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:58.9140691Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:58.9140770Z E ^ 2025-05-07T20:32:58.9141122Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:58.9141172Z 2025-05-07T20:32:58.9141632Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:58.9141637Z 2025-05-07T20:32:58.9141748Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.9141969Z self=, 2025-05-07T20:32:58.9142054Z T=128, 2025-05-07T20:32:58.9142133Z D=5120, 2025-05-07T20:32:58.9142217Z scale_ub=None, 2025-05-07T20:32:58.9142309Z contiguous=True, 2025-05-07T20:32:58.9142393Z compiled=True, 2025-05-07T20:32:58.9142462Z ) 2025-05-07T20:32:58.9142682Z self = 2025-05-07T20:32:58.9142849Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:58.9142853Z 2025-05-07T20:32:58.9142928Z @given( 2025-05-07T20:32:58.9143058Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.9143155Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.9143272Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.9143394Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.9143504Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.9143581Z ) 2025-05-07T20:32:58.9143876Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.9143970Z def test_silu_mul_quant( 2025-05-07T20:32:58.9144052Z self, 2025-05-07T20:32:58.9144128Z T: int, 2025-05-07T20:32:58.9144203Z D: int, 2025-05-07T20:32:58.9144307Z scale_ub: Optional[float], 2025-05-07T20:32:58.9144396Z contiguous: bool, 2025-05-07T20:32:58.9144480Z compiled: bool, 2025-05-07T20:32:58.9144564Z ) -> None: 2025-05-07T20:32:58.9144657Z torch.manual_seed(2025) 2025-05-07T20:32:58.9144730Z 2025-05-07T20:32:58.9144907Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.9144984Z 2025-05-07T20:32:58.9145086Z x_sign = torch.sign(x) 2025-05-07T20:32:58.9145215Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:58.9145302Z x = x_sign * x_clamp 2025-05-07T20:32:58.9145389Z x0 = x[:, :D] 2025-05-07T20:32:58.9145470Z x1 = x[:, D:] 2025-05-07T20:32:58.9145540Z 2025-05-07T20:32:58.9145638Z if contiguous: 2025-05-07T20:32:58.9145778Z x0 = x0.contiguous() 2025-05-07T20:32:58.9145868Z x1 = x1.contiguous() 2025-05-07T20:32:58.9145947Z 2025-05-07T20:32:58.9146036Z if scale_ub is not None: 2025-05-07T20:32:58.9146142Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:58.9146288Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:58.9146361Z ) 2025-05-07T20:32:58.9146441Z else: 2025-05-07T20:32:58.9146535Z scale_ub_tensor = None 2025-05-07T20:32:58.9146607Z 2025-05-07T20:32:58.9146744Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:58.9146836Z op = silu_mul_quant 2025-05-07T20:32:58.9146923Z if compiled: 2025-05-07T20:32:58.9147028Z op = torch.compile(op) 2025-05-07T20:32:58.9147133Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.9147208Z 2025-05-07T20:32:58.9147307Z y_fp8, y_scale = fn() 2025-05-07T20:32:58.9147428Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:58.9147498Z 2025-05-07T20:32:58.9147641Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:58.9147742Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:58.9147847Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:58.9147969Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:58.9148107Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:58.9148233Z 2025-05-07T20:32:58.9148333Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:58.9148338Z 2025-05-07T20:32:58.9148475Z moe/activation_test.py:126: 2025-05-07T20:32:58.9148610Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.9148717Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:58.9148853Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:58.9149426Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:58.9149533Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:58.9149896Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:58.9150116Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:58.9150480Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:58.9150749Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:58.9151143Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:58.9151397Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:58.9151810Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:58.9151985Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:58.9152338Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:58.9152413Z fn() 2025-05-07T20:32:58.9152814Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:58.9152900Z self.fn.run( 2025-05-07T20:32:58.9153238Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:58.9153338Z kernel = self.compile( 2025-05-07T20:32:58.9153717Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:58.9153892Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:58.9154029Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.9154077Z 2025-05-07T20:32:58.9154283Z self = 2025-05-07T20:32:58.9155060Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:58.9155568Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3254f24280>} 2025-05-07T20:32:58.9156304Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:58.9156501Z context = 2025-05-07T20:32:58.9156507Z 2025-05-07T20:32:58.9156671Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:58.9156937Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:58.9169013Z module_map=module_map) 2025-05-07T20:32:58.9169226Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:58.9169331Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:58.9169503Z E ^ 2025-05-07T20:32:58.9169909Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:58.9169915Z 2025-05-07T20:32:58.9170335Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:58.9170340Z 2025-05-07T20:32:58.9170457Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.9170687Z self=, 2025-05-07T20:32:58.9170771Z T=4096, 2025-05-07T20:32:58.9170861Z D=5120, 2025-05-07T20:32:58.9170946Z scale_ub=None, 2025-05-07T20:32:58.9171034Z contiguous=True, 2025-05-07T20:32:58.9171127Z compiled=True, 2025-05-07T20:32:58.9171209Z ) 2025-05-07T20:32:58.9171436Z self = 2025-05-07T20:32:58.9171610Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:58.9171618Z 2025-05-07T20:32:58.9171700Z @given( 2025-05-07T20:32:58.9171845Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.9171947Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.9172066Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.9172197Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.9172315Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.9172438Z ) 2025-05-07T20:32:58.9172694Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.9172798Z def test_silu_mul_quant( 2025-05-07T20:32:58.9172886Z self, 2025-05-07T20:32:58.9172968Z T: int, 2025-05-07T20:32:58.9173056Z D: int, 2025-05-07T20:32:58.9173170Z scale_ub: Optional[float], 2025-05-07T20:32:58.9173269Z contiguous: bool, 2025-05-07T20:32:58.9173358Z compiled: bool, 2025-05-07T20:32:58.9173450Z ) -> None: 2025-05-07T20:32:58.9173555Z torch.manual_seed(2025) 2025-05-07T20:32:58.9173636Z 2025-05-07T20:32:58.9173819Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.9173898Z 2025-05-07T20:32:58.9174003Z x_sign = torch.sign(x) 2025-05-07T20:32:58.9174140Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:58.9174237Z x = x_sign * x_clamp 2025-05-07T20:32:58.9174339Z x0 = x[:, :D] 2025-05-07T20:32:58.9174424Z x1 = x[:, D:] 2025-05-07T20:32:58.9174549Z 2025-05-07T20:32:58.9174647Z if contiguous: 2025-05-07T20:32:58.9174742Z x0 = x0.contiguous() 2025-05-07T20:32:58.9174834Z x1 = x1.contiguous() 2025-05-07T20:32:58.9174921Z 2025-05-07T20:32:58.9175016Z if scale_ub is not None: 2025-05-07T20:32:58.9175126Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:58.9175274Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:58.9175355Z ) 2025-05-07T20:32:58.9175434Z else: 2025-05-07T20:32:58.9175544Z scale_ub_tensor = None 2025-05-07T20:32:58.9175622Z 2025-05-07T20:32:58.9175761Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:58.9175855Z op = silu_mul_quant 2025-05-07T20:32:58.9175943Z if compiled: 2025-05-07T20:32:58.9176056Z op = torch.compile(op) 2025-05-07T20:32:58.9176168Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.9176247Z 2025-05-07T20:32:58.9176349Z y_fp8, y_scale = fn() 2025-05-07T20:32:58.9176475Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:58.9176551Z 2025-05-07T20:32:58.9176701Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:58.9176808Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:58.9176911Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:58.9177044Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:58.9177236Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:58.9177313Z 2025-05-07T20:32:58.9177452Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:58.9177457Z 2025-05-07T20:32:58.9177557Z moe/activation_test.py:126: 2025-05-07T20:32:58.9177688Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.9177795Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:58.9177933Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:58.9178510Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:58.9178614Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:58.9178987Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:58.9179219Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:58.9181525Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:58.9181800Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:58.9182206Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:58.9182512Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:58.9182896Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:58.9183068Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:58.9183423Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:58.9183502Z fn() 2025-05-07T20:32:58.9183904Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:58.9183998Z self.fn.run( 2025-05-07T20:32:58.9184341Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:58.9184444Z kernel = self.compile( 2025-05-07T20:32:58.9184826Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:58.9185048Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:58.9185186Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.9185191Z 2025-05-07T20:32:58.9185406Z self = 2025-05-07T20:32:58.9186188Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:58.9186694Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3254f252d0>} 2025-05-07T20:32:58.9187452Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:58.9187658Z context = 2025-05-07T20:32:58.9187663Z 2025-05-07T20:32:58.9187834Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:58.9188111Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:58.9188222Z module_map=module_map) 2025-05-07T20:32:58.9188386Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:58.9188548Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:58.9188628Z E ^ 2025-05-07T20:32:58.9189028Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:58.9189042Z 2025-05-07T20:32:58.9189463Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:58.9189470Z 2025-05-07T20:32:58.9189581Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.9190037Z self=, 2025-05-07T20:32:58.9190157Z T=16384, 2025-05-07T20:32:58.9190269Z D=5120, 2025-05-07T20:32:58.9190398Z scale_ub=None, 2025-05-07T20:32:58.9190492Z contiguous=True, 2025-05-07T20:32:58.9190579Z compiled=True, 2025-05-07T20:32:58.9190668Z ) 2025-05-07T20:32:58.9190884Z self = 2025-05-07T20:32:58.9191074Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:58.9191082Z 2025-05-07T20:32:58.9191160Z @given( 2025-05-07T20:32:58.9191282Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.9191393Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.9191515Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.9191785Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.9191915Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.9191994Z ) 2025-05-07T20:32:58.9192246Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.9192352Z def test_silu_mul_quant( 2025-05-07T20:32:58.9192432Z self, 2025-05-07T20:32:58.9192513Z T: int, 2025-05-07T20:32:58.9192589Z D: int, 2025-05-07T20:32:58.9192690Z scale_ub: Optional[float], 2025-05-07T20:32:58.9192792Z contiguous: bool, 2025-05-07T20:32:58.9192878Z compiled: bool, 2025-05-07T20:32:58.9192958Z ) -> None: 2025-05-07T20:32:58.9193065Z torch.manual_seed(2025) 2025-05-07T20:32:58.9193140Z 2025-05-07T20:32:58.9193308Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.9193393Z 2025-05-07T20:32:58.9193493Z x_sign = torch.sign(x) 2025-05-07T20:32:58.9193622Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:58.9193722Z x = x_sign * x_clamp 2025-05-07T20:32:58.9193886Z x0 = x[:, :D] 2025-05-07T20:32:58.9193968Z x1 = x[:, D:] 2025-05-07T20:32:58.9194046Z 2025-05-07T20:32:58.9194138Z if contiguous: 2025-05-07T20:32:58.9194232Z x0 = x0.contiguous() 2025-05-07T20:32:58.9194322Z x1 = x1.contiguous() 2025-05-07T20:32:58.9194401Z 2025-05-07T20:32:58.9194492Z if scale_ub is not None: 2025-05-07T20:32:58.9194606Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:58.9194746Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:58.9194827Z ) 2025-05-07T20:32:58.9194913Z else: 2025-05-07T20:32:58.9195012Z scale_ub_tensor = None 2025-05-07T20:32:58.9195088Z 2025-05-07T20:32:58.9195227Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:58.9195319Z op = silu_mul_quant 2025-05-07T20:32:58.9195410Z if compiled: 2025-05-07T20:32:58.9195524Z op = torch.compile(op) 2025-05-07T20:32:58.9195637Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.9195712Z 2025-05-07T20:32:58.9195820Z y_fp8, y_scale = fn() 2025-05-07T20:32:58.9195944Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:58.9196025Z 2025-05-07T20:32:58.9196165Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:58.9196269Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:58.9196469Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:58.9196596Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:58.9196827Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:58.9196909Z 2025-05-07T20:32:58.9197010Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:58.9197015Z 2025-05-07T20:32:58.9197116Z moe/activation_test.py:126: 2025-05-07T20:32:58.9197254Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.9197366Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:58.9197512Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:58.9198105Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:58.9198220Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:58.9198607Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:58.9198839Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:58.9199220Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:58.9199475Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:58.9199919Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:58.9200187Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:58.9200568Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:58.9200735Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:58.9201089Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:58.9201172Z fn() 2025-05-07T20:32:58.9201585Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:58.9201670Z self.fn.run( 2025-05-07T20:32:58.9202005Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:58.9202107Z kernel = self.compile( 2025-05-07T20:32:58.9202495Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:58.9202719Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:58.9202847Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.9202852Z 2025-05-07T20:32:58.9203059Z self = 2025-05-07T20:32:58.9203839Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:58.9204341Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3254f36320>} 2025-05-07T20:32:58.9205089Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:58.9205283Z context = 2025-05-07T20:32:58.9205288Z 2025-05-07T20:32:58.9205456Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:58.9205727Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:58.9205882Z module_map=module_map) 2025-05-07T20:32:58.9206091Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:58.9206200Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:58.9206280Z E ^ 2025-05-07T20:32:58.9206642Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:58.9206647Z 2025-05-07T20:32:58.9207068Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:58.9207075Z 2025-05-07T20:32:58.9207190Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.9207412Z self=, 2025-05-07T20:32:58.9207490Z T=1, 2025-05-07T20:32:58.9207576Z D=5120, 2025-05-07T20:32:58.9207662Z scale_ub=1200.0, 2025-05-07T20:32:58.9207750Z contiguous=True, 2025-05-07T20:32:58.9207843Z compiled=True, 2025-05-07T20:32:58.9207922Z ) 2025-05-07T20:32:58.9208141Z self = 2025-05-07T20:32:58.9208319Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:58.9208323Z 2025-05-07T20:32:58.9208403Z @given( 2025-05-07T20:32:58.9208524Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.9208631Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.9208790Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.9208919Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.9209035Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.9209113Z ) 2025-05-07T20:32:58.9209370Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.9209468Z def test_silu_mul_quant( 2025-05-07T20:32:58.9209548Z self, 2025-05-07T20:32:58.9209632Z T: int, 2025-05-07T20:32:58.9209716Z D: int, 2025-05-07T20:32:58.9209818Z scale_ub: Optional[float], 2025-05-07T20:32:58.9209918Z contiguous: bool, 2025-05-07T20:32:58.9210008Z compiled: bool, 2025-05-07T20:32:58.9210096Z ) -> None: 2025-05-07T20:32:58.9210193Z torch.manual_seed(2025) 2025-05-07T20:32:58.9210269Z 2025-05-07T20:32:58.9210446Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.9210523Z 2025-05-07T20:32:58.9210620Z x_sign = torch.sign(x) 2025-05-07T20:32:58.9210819Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:58.9210910Z x = x_sign * x_clamp 2025-05-07T20:32:58.9210992Z x0 = x[:, :D] 2025-05-07T20:32:58.9211082Z x1 = x[:, D:] 2025-05-07T20:32:58.9211158Z 2025-05-07T20:32:58.9211246Z if contiguous: 2025-05-07T20:32:58.9211348Z x0 = x0.contiguous() 2025-05-07T20:32:58.9211439Z x1 = x1.contiguous() 2025-05-07T20:32:58.9211514Z 2025-05-07T20:32:58.9211617Z if scale_ub is not None: 2025-05-07T20:32:58.9211725Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:58.9211871Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:58.9211950Z ) 2025-05-07T20:32:58.9212030Z else: 2025-05-07T20:32:58.9212136Z scale_ub_tensor = None 2025-05-07T20:32:58.9212210Z 2025-05-07T20:32:58.9212344Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:58.9212444Z op = silu_mul_quant 2025-05-07T20:32:58.9212536Z if compiled: 2025-05-07T20:32:58.9212641Z op = torch.compile(op) 2025-05-07T20:32:58.9212759Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.9212835Z 2025-05-07T20:32:58.9212933Z > y_fp8, y_scale = fn() 2025-05-07T20:32:58.9212944Z 2025-05-07T20:32:58.9213045Z moe/activation_test.py:117: 2025-05-07T20:32:58.9213174Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.9213334Z moe/activation_test.py:115: in fn 2025-05-07T20:32:58.9213437Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.9213850Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:58.9213954Z return fn(*args, **kwargs) 2025-05-07T20:32:58.9214448Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:58.9214553Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:58.9214921Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:58.9215145Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:58.9215490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:58.9215586Z kernel = self.compile( 2025-05-07T20:32:58.9215980Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:58.9216168Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:58.9216298Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.9216302Z 2025-05-07T20:32:58.9216519Z self = 2025-05-07T20:32:58.9217331Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:58.9217841Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f325442a710>} 2025-05-07T20:32:58.9218658Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:58.9218851Z context = 2025-05-07T20:32:58.9218856Z 2025-05-07T20:32:58.9219030Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:58.9219300Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:58.9219452Z module_map=module_map) 2025-05-07T20:32:58.9219623Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:58.9219723Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:58.9219909Z E ^ 2025-05-07T20:32:58.9220264Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:58.9220269Z 2025-05-07T20:32:58.9220694Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:58.9220698Z 2025-05-07T20:32:58.9220811Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.9221034Z self=, 2025-05-07T20:32:58.9221119Z T=1, 2025-05-07T20:32:58.9221196Z D=5120, 2025-05-07T20:32:58.9221280Z scale_ub=None, 2025-05-07T20:32:58.9221379Z contiguous=False, 2025-05-07T20:32:58.9221466Z compiled=True, 2025-05-07T20:32:58.9221544Z ) 2025-05-07T20:32:58.9221765Z self = 2025-05-07T20:32:58.9221929Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:58.9221934Z 2025-05-07T20:32:58.9222014Z @given( 2025-05-07T20:32:58.9222140Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.9222244Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.9222418Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.9222539Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.9222696Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.9222782Z ) 2025-05-07T20:32:58.9223034Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.9223130Z def test_silu_mul_quant( 2025-05-07T20:32:58.9223215Z self, 2025-05-07T20:32:58.9223300Z T: int, 2025-05-07T20:32:58.9223380Z D: int, 2025-05-07T20:32:58.9223488Z scale_ub: Optional[float], 2025-05-07T20:32:58.9223580Z contiguous: bool, 2025-05-07T20:32:58.9223668Z compiled: bool, 2025-05-07T20:32:58.9223758Z ) -> None: 2025-05-07T20:32:58.9223854Z torch.manual_seed(2025) 2025-05-07T20:32:58.9223929Z 2025-05-07T20:32:58.9224103Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.9224178Z 2025-05-07T20:32:58.9224284Z x_sign = torch.sign(x) 2025-05-07T20:32:58.9224413Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:58.9224509Z x = x_sign * x_clamp 2025-05-07T20:32:58.9224597Z x0 = x[:, :D] 2025-05-07T20:32:58.9224681Z x1 = x[:, D:] 2025-05-07T20:32:58.9224756Z 2025-05-07T20:32:58.9224849Z if contiguous: 2025-05-07T20:32:58.9224944Z x0 = x0.contiguous() 2025-05-07T20:32:58.9225081Z x1 = x1.contiguous() 2025-05-07T20:32:58.9225163Z 2025-05-07T20:32:58.9225260Z if scale_ub is not None: 2025-05-07T20:32:58.9225370Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:58.9225512Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:58.9225590Z ) 2025-05-07T20:32:58.9225674Z else: 2025-05-07T20:32:58.9225770Z scale_ub_tensor = None 2025-05-07T20:32:58.9225844Z 2025-05-07T20:32:58.9225980Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:58.9226076Z op = silu_mul_quant 2025-05-07T20:32:58.9226164Z if compiled: 2025-05-07T20:32:58.9226277Z op = torch.compile(op) 2025-05-07T20:32:58.9226386Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.9226462Z 2025-05-07T20:32:58.9226565Z y_fp8, y_scale = fn() 2025-05-07T20:32:58.9226691Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:58.9226766Z 2025-05-07T20:32:58.9226916Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:58.9227065Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:58.9227177Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:58.9227302Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:58.9227447Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:58.9227534Z 2025-05-07T20:32:58.9227638Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:58.9227645Z 2025-05-07T20:32:58.9227747Z moe/activation_test.py:126: 2025-05-07T20:32:58.9227885Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.9227997Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:58.9228145Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:58.9228714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:58.9228819Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:58.9229193Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:58.9229421Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:58.9229789Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:58.9230059Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:58.9230665Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:58.9230928Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:58.9231308Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:58.9231479Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:58.9231831Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:58.9231910Z fn() 2025-05-07T20:32:58.9232313Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:58.9232397Z self.fn.run( 2025-05-07T20:32:58.9232735Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:58.9232843Z kernel = self.compile( 2025-05-07T20:32:58.9233223Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:58.9233403Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:58.9233538Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.9233587Z 2025-05-07T20:32:58.9233809Z self = 2025-05-07T20:32:58.9234590Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:58.9235098Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3254f34a60>} 2025-05-07T20:32:58.9235865Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:58.9236057Z context = 2025-05-07T20:32:58.9236062Z 2025-05-07T20:32:58.9236240Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:58.9236546Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:58.9236658Z module_map=module_map) 2025-05-07T20:32:58.9236830Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:58.9236937Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:58.9237017Z E ^ 2025-05-07T20:32:58.9237380Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:58.9237387Z 2025-05-07T20:32:58.9237802Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:58.9237807Z 2025-05-07T20:32:58.9237925Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.9238147Z self=, 2025-05-07T20:32:58.9238227Z T=1, 2025-05-07T20:32:58.9238319Z D=5120, 2025-05-07T20:32:58.9238407Z scale_ub=None, 2025-05-07T20:32:58.9238496Z contiguous=True, 2025-05-07T20:32:58.9238591Z compiled=False, 2025-05-07T20:32:58.9238666Z ) 2025-05-07T20:32:58.9238883Z self = 2025-05-07T20:32:58.9239056Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:58.9239061Z 2025-05-07T20:32:58.9239141Z @given( 2025-05-07T20:32:58.9239267Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.9239420Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.9239581Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.9239712Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.9239828Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.9239909Z ) 2025-05-07T20:32:58.9240162Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.9240263Z def test_silu_mul_quant( 2025-05-07T20:32:58.9240354Z self, 2025-05-07T20:32:58.9240432Z T: int, 2025-05-07T20:32:58.9240512Z D: int, 2025-05-07T20:32:58.9240621Z scale_ub: Optional[float], 2025-05-07T20:32:58.9240713Z contiguous: bool, 2025-05-07T20:32:58.9240800Z compiled: bool, 2025-05-07T20:32:58.9240886Z ) -> None: 2025-05-07T20:32:58.9240984Z torch.manual_seed(2025) 2025-05-07T20:32:58.9241058Z 2025-05-07T20:32:58.9241232Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.9241310Z 2025-05-07T20:32:58.9241406Z x_sign = torch.sign(x) 2025-05-07T20:32:58.9241542Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:58.9241633Z x = x_sign * x_clamp 2025-05-07T20:32:58.9241716Z x0 = x[:, :D] 2025-05-07T20:32:58.9241804Z x1 = x[:, D:] 2025-05-07T20:32:58.9241878Z 2025-05-07T20:32:58.9242015Z if contiguous: 2025-05-07T20:32:58.9242111Z x0 = x0.contiguous() 2025-05-07T20:32:58.9242205Z x1 = x1.contiguous() 2025-05-07T20:32:58.9242291Z 2025-05-07T20:32:58.9242385Z if scale_ub is not None: 2025-05-07T20:32:58.9242492Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:58.9242639Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:58.9242717Z ) 2025-05-07T20:32:58.9242797Z else: 2025-05-07T20:32:58.9242901Z scale_ub_tensor = None 2025-05-07T20:32:58.9242980Z 2025-05-07T20:32:58.9243112Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:58.9243213Z op = silu_mul_quant 2025-05-07T20:32:58.9243301Z if compiled: 2025-05-07T20:32:58.9243411Z op = torch.compile(op) 2025-05-07T20:32:58.9243518Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.9243592Z 2025-05-07T20:32:58.9243694Z > y_fp8, y_scale = fn() 2025-05-07T20:32:58.9243701Z 2025-05-07T20:32:58.9243801Z moe/activation_test.py:117: 2025-05-07T20:32:58.9243976Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.9244088Z moe/activation_test.py:115: in fn 2025-05-07T20:32:58.9244189Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.9244687Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:58.9244794Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:58.9245157Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:58.9245389Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:58.9245734Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:58.9245831Z kernel = self.compile( 2025-05-07T20:32:58.9246221Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:58.9246401Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:58.9246535Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.9246539Z 2025-05-07T20:32:58.9246748Z self = 2025-05-07T20:32:58.9247562Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:58.9248101Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3255071e10>} 2025-05-07T20:32:58.9248844Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:58.9249046Z context = 2025-05-07T20:32:58.9249051Z 2025-05-07T20:32:58.9249218Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:58.9249481Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:58.9249600Z module_map=module_map) 2025-05-07T20:32:58.9249767Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:58.9249877Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:58.9249957Z E ^ 2025-05-07T20:32:58.9250311Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:58.9250315Z 2025-05-07T20:32:58.9250789Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:58.9250797Z 2025-05-07T20:32:58.9250904Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.9251133Z self=, 2025-05-07T20:32:58.9251215Z T=128, 2025-05-07T20:32:58.9251293Z D=5120, 2025-05-07T20:32:58.9251388Z scale_ub=None, 2025-05-07T20:32:58.9251478Z contiguous=False, 2025-05-07T20:32:58.9251564Z compiled=True, 2025-05-07T20:32:58.9251653Z ) 2025-05-07T20:32:58.9251870Z self = 2025-05-07T20:32:58.9252048Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:58.9252053Z 2025-05-07T20:32:58.9252143Z @given( 2025-05-07T20:32:58.9252265Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.9252372Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.9252493Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.9252659Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.9252783Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.9252858Z ) 2025-05-07T20:32:58.9253111Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.9253215Z def test_silu_mul_quant( 2025-05-07T20:32:58.9253293Z self, 2025-05-07T20:32:58.9253373Z T: int, 2025-05-07T20:32:58.9253465Z D: int, 2025-05-07T20:32:58.9253571Z scale_ub: Optional[float], 2025-05-07T20:32:58.9253663Z contiguous: bool, 2025-05-07T20:32:58.9253762Z compiled: bool, 2025-05-07T20:32:58.9253844Z ) -> None: 2025-05-07T20:32:58.9253950Z torch.manual_seed(2025) 2025-05-07T20:32:58.9254026Z 2025-05-07T20:32:58.9254197Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.9254282Z 2025-05-07T20:32:58.9254379Z x_sign = torch.sign(x) 2025-05-07T20:32:58.9254511Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:58.9254612Z x = x_sign * x_clamp 2025-05-07T20:32:58.9254694Z x0 = x[:, :D] 2025-05-07T20:32:58.9254778Z x1 = x[:, D:] 2025-05-07T20:32:58.9254860Z 2025-05-07T20:32:58.9254945Z if contiguous: 2025-05-07T20:32:58.9255038Z x0 = x0.contiguous() 2025-05-07T20:32:58.9255134Z x1 = x1.contiguous() 2025-05-07T20:32:58.9255207Z 2025-05-07T20:32:58.9255299Z if scale_ub is not None: 2025-05-07T20:32:58.9255460Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:58.9255638Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:58.9255725Z ) 2025-05-07T20:32:58.9255805Z else: 2025-05-07T20:32:58.9255901Z scale_ub_tensor = None 2025-05-07T20:32:58.9255983Z 2025-05-07T20:32:58.9256113Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:58.9256206Z op = silu_mul_quant 2025-05-07T20:32:58.9256302Z if compiled: 2025-05-07T20:32:58.9256404Z op = torch.compile(op) 2025-05-07T20:32:58.9256513Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.9256594Z 2025-05-07T20:32:58.9256688Z > y_fp8, y_scale = fn() 2025-05-07T20:32:58.9256692Z 2025-05-07T20:32:58.9256797Z moe/activation_test.py:117: 2025-05-07T20:32:58.9256925Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.9257032Z moe/activation_test.py:115: in fn 2025-05-07T20:32:58.9257140Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.9257511Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:58.9257606Z return fn(*args, **kwargs) 2025-05-07T20:32:58.9258199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:58.9258301Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:58.9258672Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:58.9258895Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:58.9259240Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:58.9259339Z kernel = self.compile( 2025-05-07T20:32:58.9259728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:58.9260010Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:58.9260145Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.9260150Z 2025-05-07T20:32:58.9260359Z self = 2025-05-07T20:32:58.9261138Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:58.9261708Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3255070310>} 2025-05-07T20:32:58.9262454Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:58.9262648Z context = 2025-05-07T20:32:58.9262653Z 2025-05-07T20:32:58.9262821Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:58.9263095Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:58.9263204Z module_map=module_map) 2025-05-07T20:32:58.9263378Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:58.9263478Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:58.9263559Z E ^ 2025-05-07T20:32:58.9263919Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:58.9263924Z 2025-05-07T20:32:58.9264342Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:58.9264390Z 2025-05-07T20:32:58.9264537Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.9264768Z self=, 2025-05-07T20:32:58.9264847Z T=128, 2025-05-07T20:32:58.9264933Z D=7168, 2025-05-07T20:32:58.9265018Z scale_ub=1200.0, 2025-05-07T20:32:58.9265108Z contiguous=False, 2025-05-07T20:32:58.9265203Z compiled=False, 2025-05-07T20:32:58.9265282Z ) 2025-05-07T20:32:58.9265500Z self = 2025-05-07T20:32:58.9265682Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:58.9265686Z 2025-05-07T20:32:58.9265765Z @given( 2025-05-07T20:32:58.9265886Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.9265992Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.9266110Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.9266240Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.9266357Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.9266436Z ) 2025-05-07T20:32:58.9266694Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.9266790Z def test_silu_mul_quant( 2025-05-07T20:32:58.9266869Z self, 2025-05-07T20:32:58.9267000Z T: int, 2025-05-07T20:32:58.9267080Z D: int, 2025-05-07T20:32:58.9267186Z scale_ub: Optional[float], 2025-05-07T20:32:58.9267283Z contiguous: bool, 2025-05-07T20:32:58.9267371Z compiled: bool, 2025-05-07T20:32:58.9267453Z ) -> None: 2025-05-07T20:32:58.9267554Z torch.manual_seed(2025) 2025-05-07T20:32:58.9267629Z 2025-05-07T20:32:58.9267805Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.9267881Z 2025-05-07T20:32:58.9267976Z x_sign = torch.sign(x) 2025-05-07T20:32:58.9268114Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:58.9268208Z x = x_sign * x_clamp 2025-05-07T20:32:58.9268291Z x0 = x[:, :D] 2025-05-07T20:32:58.9268379Z x1 = x[:, D:] 2025-05-07T20:32:58.9268452Z 2025-05-07T20:32:58.9268541Z if contiguous: 2025-05-07T20:32:58.9268642Z x0 = x0.contiguous() 2025-05-07T20:32:58.9268732Z x1 = x1.contiguous() 2025-05-07T20:32:58.9268809Z 2025-05-07T20:32:58.9268960Z if scale_ub is not None: 2025-05-07T20:32:58.9269067Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:58.9269211Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:58.9269290Z ) 2025-05-07T20:32:58.9269368Z else: 2025-05-07T20:32:58.9269469Z scale_ub_tensor = None 2025-05-07T20:32:58.9269544Z 2025-05-07T20:32:58.9269675Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:58.9269779Z op = silu_mul_quant 2025-05-07T20:32:58.9269866Z if compiled: 2025-05-07T20:32:58.9269970Z op = torch.compile(op) 2025-05-07T20:32:58.9270086Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.9270160Z 2025-05-07T20:32:58.9270253Z > y_fp8, y_scale = fn() 2025-05-07T20:32:58.9270257Z 2025-05-07T20:32:58.9270364Z moe/activation_test.py:117: 2025-05-07T20:32:58.9270495Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.9270608Z moe/activation_test.py:115: in fn 2025-05-07T20:32:58.9270710Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.9271206Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:58.9271311Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:58.9271670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:58.9271944Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:58.9272334Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:58.9272433Z kernel = self.compile( 2025-05-07T20:32:58.9272821Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:58.9273002Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:58.9273134Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.9273138Z 2025-05-07T20:32:58.9273348Z self = 2025-05-07T20:32:58.9274131Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:58.9274640Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3254f35900>} 2025-05-07T20:32:58.9275433Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:58.9275625Z context = 2025-05-07T20:32:58.9275638Z 2025-05-07T20:32:58.9275809Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:58.9276077Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:58.9276192Z module_map=module_map) 2025-05-07T20:32:58.9276354Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:58.9276459Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:58.9276549Z E ^ 2025-05-07T20:32:58.9276906Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:58.9276910Z 2025-05-07T20:32:58.9277336Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:58.9277340Z 2025-05-07T20:32:58.9277449Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.9277714Z self=, 2025-05-07T20:32:58.9277797Z T=128, 2025-05-07T20:32:58.9277877Z D=5120, 2025-05-07T20:32:58.9277961Z scale_ub=None, 2025-05-07T20:32:58.9278056Z contiguous=False, 2025-05-07T20:32:58.9278141Z compiled=False, 2025-05-07T20:32:58.9278218Z ) 2025-05-07T20:32:58.9278440Z self = 2025-05-07T20:32:58.9278616Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:58.9278621Z 2025-05-07T20:32:58.9278705Z @given( 2025-05-07T20:32:58.9278829Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.9278932Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.9279055Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.9279172Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.9279294Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.9279379Z ) 2025-05-07T20:32:58.9279630Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.9279736Z def test_silu_mul_quant( 2025-05-07T20:32:58.9279815Z self, 2025-05-07T20:32:58.9279893Z T: int, 2025-05-07T20:32:58.9279977Z D: int, 2025-05-07T20:32:58.9280077Z scale_ub: Optional[float], 2025-05-07T20:32:58.9280169Z contiguous: bool, 2025-05-07T20:32:58.9280315Z compiled: bool, 2025-05-07T20:32:58.9280395Z ) -> None: 2025-05-07T20:32:58.9280492Z torch.manual_seed(2025) 2025-05-07T20:32:58.9280616Z 2025-05-07T20:32:58.9280787Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.9280865Z 2025-05-07T20:32:58.9280967Z x_sign = torch.sign(x) 2025-05-07T20:32:58.9281095Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:58.9281190Z x = x_sign * x_clamp 2025-05-07T20:32:58.9281282Z x0 = x[:, :D] 2025-05-07T20:32:58.9281367Z x1 = x[:, D:] 2025-05-07T20:32:58.9281453Z 2025-05-07T20:32:58.9281539Z if contiguous: 2025-05-07T20:32:58.9281634Z x0 = x0.contiguous() 2025-05-07T20:32:58.9281731Z x1 = x1.contiguous() 2025-05-07T20:32:58.9281804Z 2025-05-07T20:32:58.9281895Z if scale_ub is not None: 2025-05-07T20:32:58.9282008Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:58.9282148Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:58.9282226Z ) 2025-05-07T20:32:58.9282310Z else: 2025-05-07T20:32:58.9282409Z scale_ub_tensor = None 2025-05-07T20:32:58.9282487Z 2025-05-07T20:32:58.9282622Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:58.9282714Z op = silu_mul_quant 2025-05-07T20:32:58.9282809Z if compiled: 2025-05-07T20:32:58.9282955Z op = torch.compile(op) 2025-05-07T20:32:58.9283064Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.9283148Z 2025-05-07T20:32:58.9283244Z > y_fp8, y_scale = fn() 2025-05-07T20:32:58.9283248Z 2025-05-07T20:32:58.9283348Z moe/activation_test.py:117: 2025-05-07T20:32:58.9283483Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.9283587Z moe/activation_test.py:115: in fn 2025-05-07T20:32:58.9283686Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.9284204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:58.9284305Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:58.9284676Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:58.9284900Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:58.9285243Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:58.9285386Z kernel = self.compile( 2025-05-07T20:32:58.9285774Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:58.9285961Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:58.9286088Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.9286096Z 2025-05-07T20:32:58.9286304Z self = 2025-05-07T20:32:58.9287079Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:58.9287578Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3254cf2a70>} 2025-05-07T20:32:58.9288359Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:58.9288573Z context = 2025-05-07T20:32:58.9288578Z 2025-05-07T20:32:58.9288743Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:58.9289093Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:58.9289204Z module_map=module_map) 2025-05-07T20:32:58.9289376Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:58.9289477Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:58.9289555Z E ^ 2025-05-07T20:32:58.9290175Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:58.9290189Z 2025-05-07T20:32:58.9290667Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:58.9290674Z 2025-05-07T20:32:58.9290787Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.9291009Z self=, 2025-05-07T20:32:58.9291088Z T=128, 2025-05-07T20:32:58.9291175Z D=5120, 2025-05-07T20:32:58.9291259Z scale_ub=1200.0, 2025-05-07T20:32:58.9291345Z contiguous=True, 2025-05-07T20:32:58.9291439Z compiled=False, 2025-05-07T20:32:58.9291515Z ) 2025-05-07T20:32:58.9291732Z self = 2025-05-07T20:32:58.9291909Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:58.9291914Z 2025-05-07T20:32:58.9292163Z @given( 2025-05-07T20:32:58.9292294Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.9292399Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.9292518Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.9292642Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.9292758Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.9292833Z ) 2025-05-07T20:32:58.9293085Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.9293184Z def test_silu_mul_quant( 2025-05-07T20:32:58.9293266Z self, 2025-05-07T20:32:58.9293351Z T: int, 2025-05-07T20:32:58.9293429Z D: int, 2025-05-07T20:32:58.9293530Z scale_ub: Optional[float], 2025-05-07T20:32:58.9293632Z contiguous: bool, 2025-05-07T20:32:58.9293719Z compiled: bool, 2025-05-07T20:32:58.9293806Z ) -> None: 2025-05-07T20:32:58.9293902Z torch.manual_seed(2025) 2025-05-07T20:32:58.9293980Z 2025-05-07T20:32:58.9294228Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.9294304Z 2025-05-07T20:32:58.9294398Z x_sign = torch.sign(x) 2025-05-07T20:32:58.9294529Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:58.9294620Z x = x_sign * x_clamp 2025-05-07T20:32:58.9294704Z x0 = x[:, :D] 2025-05-07T20:32:58.9294791Z x1 = x[:, D:] 2025-05-07T20:32:58.9294865Z 2025-05-07T20:32:58.9294955Z if contiguous: 2025-05-07T20:32:58.9295068Z x0 = x0.contiguous() 2025-05-07T20:32:58.9300767Z x1 = x1.contiguous() 2025-05-07T20:32:58.9300875Z 2025-05-07T20:32:58.9300982Z if scale_ub is not None: 2025-05-07T20:32:58.9301096Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:58.9301246Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:58.9301324Z ) 2025-05-07T20:32:58.9301418Z else: 2025-05-07T20:32:58.9301518Z scale_ub_tensor = None 2025-05-07T20:32:58.9301598Z 2025-05-07T20:32:58.9301744Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:58.9301839Z op = silu_mul_quant 2025-05-07T20:32:58.9301929Z if compiled: 2025-05-07T20:32:58.9302048Z op = torch.compile(op) 2025-05-07T20:32:58.9302158Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.9302237Z 2025-05-07T20:32:58.9302342Z > y_fp8, y_scale = fn() 2025-05-07T20:32:58.9302471Z 2025-05-07T20:32:58.9302576Z moe/activation_test.py:117: 2025-05-07T20:32:58.9302782Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.9302900Z moe/activation_test.py:115: in fn 2025-05-07T20:32:58.9303009Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.9303526Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:58.9303630Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:58.9303998Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:58.9304235Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:58.9304581Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:58.9304687Z kernel = self.compile( 2025-05-07T20:32:58.9305075Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:58.9305255Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:58.9305395Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.9305400Z 2025-05-07T20:32:58.9305608Z self = 2025-05-07T20:32:58.9306420Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:58.9306944Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3254cf32e0>} 2025-05-07T20:32:58.9307690Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:58.9307891Z context = 2025-05-07T20:32:58.9307896Z 2025-05-07T20:32:58.9308066Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:58.9308343Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:58.9308498Z module_map=module_map) 2025-05-07T20:32:58.9308664Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:58.9308773Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:58.9308855Z E ^ 2025-05-07T20:32:58.9309211Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:58.9309216Z 2025-05-07T20:32:58.9309635Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:58.9309643Z 2025-05-07T20:32:58.9309753Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.9309987Z self=, 2025-05-07T20:32:58.9310069Z T=1, 2025-05-07T20:32:58.9310150Z D=7168, 2025-05-07T20:32:58.9310247Z scale_ub=1200.0, 2025-05-07T20:32:58.9310336Z contiguous=True, 2025-05-07T20:32:58.9310425Z compiled=True, 2025-05-07T20:32:58.9310514Z ) 2025-05-07T20:32:58.9310734Z self = 2025-05-07T20:32:58.9310904Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:58.9310918Z 2025-05-07T20:32:58.9310998Z @given( 2025-05-07T20:32:58.9311119Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.9311231Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.9311396Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.9311518Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.9311685Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.9311766Z ) 2025-05-07T20:32:58.9312018Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.9312125Z def test_silu_mul_quant( 2025-05-07T20:32:58.9312206Z self, 2025-05-07T20:32:58.9312296Z T: int, 2025-05-07T20:32:58.9312376Z D: int, 2025-05-07T20:32:58.9312481Z scale_ub: Optional[float], 2025-05-07T20:32:58.9312582Z contiguous: bool, 2025-05-07T20:32:58.9312672Z compiled: bool, 2025-05-07T20:32:58.9312754Z ) -> None: 2025-05-07T20:32:58.9312859Z torch.manual_seed(2025) 2025-05-07T20:32:58.9312935Z 2025-05-07T20:32:58.9313109Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.9313193Z 2025-05-07T20:32:58.9313291Z x_sign = torch.sign(x) 2025-05-07T20:32:58.9313418Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:58.9313518Z x = x_sign * x_clamp 2025-05-07T20:32:58.9313604Z x0 = x[:, :D] 2025-05-07T20:32:58.9313691Z x1 = x[:, D:] 2025-05-07T20:32:58.9313774Z 2025-05-07T20:32:58.9313862Z if contiguous: 2025-05-07T20:32:58.9313964Z x0 = x0.contiguous() 2025-05-07T20:32:58.9314057Z x1 = x1.contiguous() 2025-05-07T20:32:58.9314177Z 2025-05-07T20:32:58.9314284Z if scale_ub is not None: 2025-05-07T20:32:58.9314392Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:58.9314531Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:58.9314618Z ) 2025-05-07T20:32:58.9314700Z else: 2025-05-07T20:32:58.9314799Z scale_ub_tensor = None 2025-05-07T20:32:58.9314882Z 2025-05-07T20:32:58.9315015Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:58.9315112Z op = silu_mul_quant 2025-05-07T20:32:58.9315211Z if compiled: 2025-05-07T20:32:58.9315317Z op = torch.compile(op) 2025-05-07T20:32:58.9315437Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.9315512Z 2025-05-07T20:32:58.9315608Z > y_fp8, y_scale = fn() 2025-05-07T20:32:58.9315613Z 2025-05-07T20:32:58.9315722Z moe/activation_test.py:117: 2025-05-07T20:32:58.9315856Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.9316010Z moe/activation_test.py:115: in fn 2025-05-07T20:32:58.9316123Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.9316495Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:58.9316592Z return fn(*args, **kwargs) 2025-05-07T20:32:58.9317103Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:58.9317208Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:58.9317575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:58.9317800Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:58.9318165Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:58.9318298Z kernel = self.compile( 2025-05-07T20:32:58.9318691Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:58.9318881Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:58.9319013Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.9319018Z 2025-05-07T20:32:58.9319226Z self = 2025-05-07T20:32:58.9320085Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:58.9320594Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3254cf30a0>} 2025-05-07T20:32:58.9321347Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:58.9321544Z context = 2025-05-07T20:32:58.9321549Z 2025-05-07T20:32:58.9321720Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:58.9321997Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:58.9322112Z module_map=module_map) 2025-05-07T20:32:58.9322295Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:58.9322397Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:58.9322478Z E ^ 2025-05-07T20:32:58.9322840Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:58.9322885Z 2025-05-07T20:32:58.9323307Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:58.9323314Z 2025-05-07T20:32:58.9323430Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.9323654Z self=, 2025-05-07T20:32:58.9323736Z T=1, 2025-05-07T20:32:58.9323826Z D=7168, 2025-05-07T20:32:58.9323913Z scale_ub=1200.0, 2025-05-07T20:32:58.9324005Z contiguous=False, 2025-05-07T20:32:58.9324104Z compiled=True, 2025-05-07T20:32:58.9324181Z ) 2025-05-07T20:32:58.9324403Z self = 2025-05-07T20:32:58.9324582Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:58.9324587Z 2025-05-07T20:32:58.9324669Z @given( 2025-05-07T20:32:58.9324800Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.9324906Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.9325065Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.9325191Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.9325317Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.9325395Z ) 2025-05-07T20:32:58.9325642Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.9325744Z def test_silu_mul_quant( 2025-05-07T20:32:58.9325824Z self, 2025-05-07T20:32:58.9325909Z T: int, 2025-05-07T20:32:58.9325994Z D: int, 2025-05-07T20:32:58.9326097Z scale_ub: Optional[float], 2025-05-07T20:32:58.9326192Z contiguous: bool, 2025-05-07T20:32:58.9326289Z compiled: bool, 2025-05-07T20:32:58.9326372Z ) -> None: 2025-05-07T20:32:58.9326469Z torch.manual_seed(2025) 2025-05-07T20:32:58.9326550Z 2025-05-07T20:32:58.9326722Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.9326807Z 2025-05-07T20:32:58.9326906Z x_sign = torch.sign(x) 2025-05-07T20:32:58.9327035Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:58.9327140Z x = x_sign * x_clamp 2025-05-07T20:32:58.9327224Z x0 = x[:, :D] 2025-05-07T20:32:58.9327307Z x1 = x[:, D:] 2025-05-07T20:32:58.9327389Z 2025-05-07T20:32:58.9327476Z if contiguous: 2025-05-07T20:32:58.9327570Z x0 = x0.contiguous() 2025-05-07T20:32:58.9327673Z x1 = x1.contiguous() 2025-05-07T20:32:58.9327795Z 2025-05-07T20:32:58.9327891Z if scale_ub is not None: 2025-05-07T20:32:58.9328072Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:58.9328211Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:58.9328299Z ) 2025-05-07T20:32:58.9328381Z else: 2025-05-07T20:32:58.9328478Z scale_ub_tensor = None 2025-05-07T20:32:58.9328562Z 2025-05-07T20:32:58.9328697Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:58.9328792Z op = silu_mul_quant 2025-05-07T20:32:58.9328887Z if compiled: 2025-05-07T20:32:58.9328990Z op = torch.compile(op) 2025-05-07T20:32:58.9329100Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.9329181Z 2025-05-07T20:32:58.9329277Z > y_fp8, y_scale = fn() 2025-05-07T20:32:58.9329281Z 2025-05-07T20:32:58.9329384Z moe/activation_test.py:117: 2025-05-07T20:32:58.9329522Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.9329630Z moe/activation_test.py:115: in fn 2025-05-07T20:32:58.9329742Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.9330115Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:58.9330210Z return fn(*args, **kwargs) 2025-05-07T20:32:58.9330752Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:58.9330860Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:58.9331221Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:58.9331456Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:58.9331800Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:58.9331905Z kernel = self.compile( 2025-05-07T20:32:58.9332289Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:58.9332468Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:58.9332601Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.9332605Z 2025-05-07T20:32:58.9332815Z self = 2025-05-07T20:32:58.9333632Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:58.9334131Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3254cf24d0>} 2025-05-07T20:32:58.9334878Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:58.9335076Z context = 2025-05-07T20:32:58.9335080Z 2025-05-07T20:32:58.9335249Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:58.9335525Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:58.9335637Z module_map=module_map) 2025-05-07T20:32:58.9335803Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:58.9335911Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:58.9335992Z E ^ 2025-05-07T20:32:58.9336351Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:58.9336401Z 2025-05-07T20:32:58.9336861Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:58.9336866Z 2025-05-07T20:32:58.9336973Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.9337201Z self=, 2025-05-07T20:32:58.9337283Z T=1, 2025-05-07T20:32:58.9337363Z D=7168, 2025-05-07T20:32:58.9337456Z scale_ub=None, 2025-05-07T20:32:58.9337546Z contiguous=False, 2025-05-07T20:32:58.9337641Z compiled=True, 2025-05-07T20:32:58.9337716Z ) 2025-05-07T20:32:58.9337932Z self = 2025-05-07T20:32:58.9338103Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:58.9338107Z 2025-05-07T20:32:58.9338187Z @given( 2025-05-07T20:32:58.9338306Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.9338413Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.9338559Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.9338697Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.9338825Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.9338901Z ) 2025-05-07T20:32:58.9339155Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.9339252Z def test_silu_mul_quant( 2025-05-07T20:32:58.9339379Z self, 2025-05-07T20:32:58.9339468Z T: int, 2025-05-07T20:32:58.9339547Z D: int, 2025-05-07T20:32:58.9339647Z scale_ub: Optional[float], 2025-05-07T20:32:58.9339746Z contiguous: bool, 2025-05-07T20:32:58.9339912Z compiled: bool, 2025-05-07T20:32:58.9339994Z ) -> None: 2025-05-07T20:32:58.9340099Z torch.manual_seed(2025) 2025-05-07T20:32:58.9340175Z 2025-05-07T20:32:58.9340345Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.9340432Z 2025-05-07T20:32:58.9340526Z x_sign = torch.sign(x) 2025-05-07T20:32:58.9340663Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:58.9340757Z x = x_sign * x_clamp 2025-05-07T20:32:58.9340840Z x0 = x[:, :D] 2025-05-07T20:32:58.9340929Z x1 = x[:, D:] 2025-05-07T20:32:58.9341005Z 2025-05-07T20:32:58.9341091Z if contiguous: 2025-05-07T20:32:58.9341196Z x0 = x0.contiguous() 2025-05-07T20:32:58.9341295Z x1 = x1.contiguous() 2025-05-07T20:32:58.9341417Z 2025-05-07T20:32:58.9341517Z if scale_ub is not None: 2025-05-07T20:32:58.9341627Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:58.9341769Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:58.9341856Z ) 2025-05-07T20:32:58.9341936Z else: 2025-05-07T20:32:58.9342033Z scale_ub_tensor = None 2025-05-07T20:32:58.9342116Z 2025-05-07T20:32:58.9342248Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:58.9342349Z op = silu_mul_quant 2025-05-07T20:32:58.9342441Z if compiled: 2025-05-07T20:32:58.9342545Z op = torch.compile(op) 2025-05-07T20:32:58.9342666Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.9342742Z 2025-05-07T20:32:58.9342837Z y_fp8, y_scale = fn() 2025-05-07T20:32:58.9342968Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:58.9343049Z 2025-05-07T20:32:58.9343195Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:58.9343306Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:58.9343411Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:58.9343543Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:58.9343687Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:58.9343763Z 2025-05-07T20:32:58.9343874Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:58.9343925Z 2025-05-07T20:32:58.9344026Z moe/activation_test.py:126: 2025-05-07T20:32:58.9344195Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.9344309Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:58.9344443Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:58.9345009Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:58.9345115Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:58.9345475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:58.9345705Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:58.9346070Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:58.9346329Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:58.9346741Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:58.9346991Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:58.9347410Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:58.9347581Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:58.9347926Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:58.9348008Z fn() 2025-05-07T20:32:58.9348441Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:58.9348541Z self.fn.run( 2025-05-07T20:32:58.9348888Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:58.9348985Z kernel = self.compile( 2025-05-07T20:32:58.9349378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:58.9349557Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:58.9349688Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.9349692Z 2025-05-07T20:32:58.9350013Z self = 2025-05-07T20:32:58.9350783Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:58.9351294Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3254552dd0>} 2025-05-07T20:32:58.9352039Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:58.9352234Z context = 2025-05-07T20:32:58.9352239Z 2025-05-07T20:32:58.9352405Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:58.9352671Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:58.9352787Z module_map=module_map) 2025-05-07T20:32:58.9352950Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:58.9353053Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:58.9353138Z E ^ 2025-05-07T20:32:58.9353488Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:58.9353541Z 2025-05-07T20:32:58.9354003Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:58.9354008Z 2025-05-07T20:32:58.9354113Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.9354335Z self=, 2025-05-07T20:32:58.9354421Z T=1, 2025-05-07T20:32:58.9354501Z D=5120, 2025-05-07T20:32:58.9354584Z scale_ub=1200.0, 2025-05-07T20:32:58.9354676Z contiguous=False, 2025-05-07T20:32:58.9354760Z compiled=True, 2025-05-07T20:32:58.9354839Z ) 2025-05-07T20:32:58.9355054Z self = 2025-05-07T20:32:58.9355218Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:58.9355227Z 2025-05-07T20:32:58.9355307Z @given( 2025-05-07T20:32:58.9355429Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.9355530Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.9355656Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.9355772Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.9355886Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.9355964Z ) 2025-05-07T20:32:58.9356254Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.9356359Z def test_silu_mul_quant( 2025-05-07T20:32:58.9356437Z self, 2025-05-07T20:32:58.9356513Z T: int, 2025-05-07T20:32:58.9356599Z D: int, 2025-05-07T20:32:58.9356697Z scale_ub: Optional[float], 2025-05-07T20:32:58.9356786Z contiguous: bool, 2025-05-07T20:32:58.9356877Z compiled: bool, 2025-05-07T20:32:58.9356957Z ) -> None: 2025-05-07T20:32:58.9357055Z torch.manual_seed(2025) 2025-05-07T20:32:58.9357134Z 2025-05-07T20:32:58.9357302Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.9357378Z 2025-05-07T20:32:58.9357482Z x_sign = torch.sign(x) 2025-05-07T20:32:58.9357608Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:58.9357705Z x = x_sign * x_clamp 2025-05-07T20:32:58.9357786Z x0 = x[:, :D] 2025-05-07T20:32:58.9357864Z x1 = x[:, D:] 2025-05-07T20:32:58.9357947Z 2025-05-07T20:32:58.9358034Z if contiguous: 2025-05-07T20:32:58.9358176Z x0 = x0.contiguous() 2025-05-07T20:32:58.9358273Z x1 = x1.contiguous() 2025-05-07T20:32:58.9358352Z 2025-05-07T20:32:58.9358450Z if scale_ub is not None: 2025-05-07T20:32:58.9358563Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:58.9358698Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:58.9358778Z ) 2025-05-07T20:32:58.9358863Z else: 2025-05-07T20:32:58.9358959Z scale_ub_tensor = None 2025-05-07T20:32:58.9359036Z 2025-05-07T20:32:58.9359173Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:58.9359261Z op = silu_mul_quant 2025-05-07T20:32:58.9359356Z if compiled: 2025-05-07T20:32:58.9359459Z op = torch.compile(op) 2025-05-07T20:32:58.9359566Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.9359649Z 2025-05-07T20:32:58.9359745Z > y_fp8, y_scale = fn() 2025-05-07T20:32:58.9359752Z 2025-05-07T20:32:58.9359851Z moe/activation_test.py:117: 2025-05-07T20:32:58.9359989Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.9360097Z moe/activation_test.py:115: in fn 2025-05-07T20:32:58.9360201Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.9360574Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:58.9360670Z return fn(*args, **kwargs) 2025-05-07T20:32:58.9361297Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:58.9361396Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:58.9361756Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:58.9361991Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:58.9362336Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:58.9362443Z kernel = self.compile( 2025-05-07T20:32:58.9362821Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:58.9363025Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:58.9363151Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.9363159Z 2025-05-07T20:32:58.9363369Z self = 2025-05-07T20:32:58.9364146Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:58.9364685Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3254553eb0>} 2025-05-07T20:32:58.9365453Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:58.9365643Z context = 2025-05-07T20:32:58.9365647Z 2025-05-07T20:32:58.9365814Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:58.9366093Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:58.9366200Z module_map=module_map) 2025-05-07T20:32:58.9366371Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:58.9366470Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:58.9366545Z E ^ 2025-05-07T20:32:58.9366907Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:58.9366955Z 2025-05-07T20:32:58.9367370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:58.9367375Z 2025-05-07T20:32:58.9367488Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.9367711Z self=, 2025-05-07T20:32:58.9367791Z T=1, 2025-05-07T20:32:58.9367871Z D=5120, 2025-05-07T20:32:58.9367958Z scale_ub=1200.0, 2025-05-07T20:32:58.9368045Z contiguous=False, 2025-05-07T20:32:58.9368133Z compiled=False, 2025-05-07T20:32:58.9368208Z ) 2025-05-07T20:32:58.9368422Z self = 2025-05-07T20:32:58.9368620Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:58.9368628Z 2025-05-07T20:32:58.9368710Z @given( 2025-05-07T20:32:58.9368856Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.9368957Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.9369073Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.9369197Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.9369311Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.9369383Z ) 2025-05-07T20:32:58.9369636Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.9369861Z def test_silu_mul_quant( 2025-05-07T20:32:58.9369940Z self, 2025-05-07T20:32:58.9370064Z T: int, 2025-05-07T20:32:58.9370143Z D: int, 2025-05-07T20:32:58.9370242Z scale_ub: Optional[float], 2025-05-07T20:32:58.9370336Z contiguous: bool, 2025-05-07T20:32:58.9370421Z compiled: bool, 2025-05-07T20:32:58.9370505Z ) -> None: 2025-05-07T20:32:58.9370602Z torch.manual_seed(2025) 2025-05-07T20:32:58.9370677Z 2025-05-07T20:32:58.9370854Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.9370929Z 2025-05-07T20:32:58.9371020Z x_sign = torch.sign(x) 2025-05-07T20:32:58.9371149Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:58.9371238Z x = x_sign * x_clamp 2025-05-07T20:32:58.9371318Z x0 = x[:, :D] 2025-05-07T20:32:58.9371402Z x1 = x[:, D:] 2025-05-07T20:32:58.9371478Z 2025-05-07T20:32:58.9371562Z if contiguous: 2025-05-07T20:32:58.9371661Z x0 = x0.contiguous() 2025-05-07T20:32:58.9371752Z x1 = x1.contiguous() 2025-05-07T20:32:58.9371827Z 2025-05-07T20:32:58.9371925Z if scale_ub is not None: 2025-05-07T20:32:58.9372030Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:58.9372171Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:58.9372249Z ) 2025-05-07T20:32:58.9372372Z else: 2025-05-07T20:32:58.9372480Z scale_ub_tensor = None 2025-05-07T20:32:58.9372555Z 2025-05-07T20:32:58.9372686Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:58.9372784Z op = silu_mul_quant 2025-05-07T20:32:58.9372870Z if compiled: 2025-05-07T20:32:58.9372974Z op = torch.compile(op) 2025-05-07T20:32:58.9373087Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.9373161Z 2025-05-07T20:32:58.9373258Z > y_fp8, y_scale = fn() 2025-05-07T20:32:58.9373269Z 2025-05-07T20:32:58.9373368Z moe/activation_test.py:117: 2025-05-07T20:32:58.9373498Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.9373604Z moe/activation_test.py:115: in fn 2025-05-07T20:32:58.9373703Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.9374206Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:58.9374356Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:58.9374716Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:58.9374945Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:58.9375283Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:58.9375383Z kernel = self.compile( 2025-05-07T20:32:58.9375778Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:58.9375955Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:58.9376079Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.9376083Z 2025-05-07T20:32:58.9376294Z self = 2025-05-07T20:32:58.9377060Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:58.9377575Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3254cf0940>} 2025-05-07T20:32:58.9378358Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:58.9378594Z context = 2025-05-07T20:32:58.9378599Z 2025-05-07T20:32:58.9378765Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:58.9379075Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:58.9379192Z module_map=module_map) 2025-05-07T20:32:58.9379353Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:58.9379451Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:58.9379543Z E ^ 2025-05-07T20:32:58.9380026Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:58.9380031Z 2025-05-07T20:32:58.9380456Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:58.9380463Z 2025-05-07T20:32:58.9380569Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.9380790Z self=, 2025-05-07T20:32:58.9380873Z T=16384, 2025-05-07T20:32:58.9380953Z D=5120, 2025-05-07T20:32:58.9381036Z scale_ub=1200.0, 2025-05-07T20:32:58.9381172Z contiguous=False, 2025-05-07T20:32:58.9381262Z compiled=True, 2025-05-07T20:32:58.9381343Z ) 2025-05-07T20:32:58.9381559Z self = 2025-05-07T20:32:58.9381740Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:58.9381744Z 2025-05-07T20:32:58.9381831Z @given( 2025-05-07T20:32:58.9381949Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.9382051Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.9382178Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.9382295Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.9382409Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.9382491Z ) 2025-05-07T20:32:58.9382736Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.9382840Z def test_silu_mul_quant( 2025-05-07T20:32:58.9382921Z self, 2025-05-07T20:32:58.9383000Z T: int, 2025-05-07T20:32:58.9383127Z D: int, 2025-05-07T20:32:58.9383228Z scale_ub: Optional[float], 2025-05-07T20:32:58.9383319Z contiguous: bool, 2025-05-07T20:32:58.9383413Z compiled: bool, 2025-05-07T20:32:58.9383491Z ) -> None: 2025-05-07T20:32:58.9383587Z torch.manual_seed(2025) 2025-05-07T20:32:58.9383668Z 2025-05-07T20:32:58.9383837Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.9383915Z 2025-05-07T20:32:58.9384020Z x_sign = torch.sign(x) 2025-05-07T20:32:58.9384146Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:58.9384246Z x = x_sign * x_clamp 2025-05-07T20:32:58.9384326Z x0 = x[:, :D] 2025-05-07T20:32:58.9384406Z x1 = x[:, D:] 2025-05-07T20:32:58.9384487Z 2025-05-07T20:32:58.9384574Z if contiguous: 2025-05-07T20:32:58.9384667Z x0 = x0.contiguous() 2025-05-07T20:32:58.9384766Z x1 = x1.contiguous() 2025-05-07T20:32:58.9384840Z 2025-05-07T20:32:58.9384930Z if scale_ub is not None: 2025-05-07T20:32:58.9385042Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:58.9385178Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:58.9385254Z ) 2025-05-07T20:32:58.9385337Z else: 2025-05-07T20:32:58.9385431Z scale_ub_tensor = None 2025-05-07T20:32:58.9385511Z 2025-05-07T20:32:58.9385641Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:58.9385779Z op = silu_mul_quant 2025-05-07T20:32:58.9385871Z if compiled: 2025-05-07T20:32:58.9386025Z op = torch.compile(op) 2025-05-07T20:32:58.9386136Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.9386213Z 2025-05-07T20:32:58.9386304Z > y_fp8, y_scale = fn() 2025-05-07T20:32:58.9386309Z 2025-05-07T20:32:58.9386405Z moe/activation_test.py:117: 2025-05-07T20:32:58.9386542Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.9386646Z moe/activation_test.py:115: in fn 2025-05-07T20:32:58.9386751Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.9387123Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:58.9387216Z return fn(*args, **kwargs) 2025-05-07T20:32:58.9387718Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:58.9387824Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:58.9388230Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:58.9388457Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:58.9388845Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:58.9388952Z kernel = self.compile( 2025-05-07T20:32:58.9389332Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:58.9389505Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:58.9389636Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.9389641Z 2025-05-07T20:32:58.9390689Z self = 2025-05-07T20:32:58.9391583Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:58.9392083Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f31337c88b0>} 2025-05-07T20:32:58.9392822Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:58.9393263Z context = 2025-05-07T20:32:58.9393270Z 2025-05-07T20:32:58.9393438Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:58.9393711Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:58.9393826Z module_map=module_map) 2025-05-07T20:32:58.9393992Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:58.9394100Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:58.9394179Z E ^ 2025-05-07T20:32:58.9394534Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:58.9394545Z 2025-05-07T20:32:58.9394966Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:58.9394971Z 2025-05-07T20:32:58.9395077Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.9395307Z self=, 2025-05-07T20:32:58.9395386Z T=2048, 2025-05-07T20:32:58.9395465Z D=7168, 2025-05-07T20:32:58.9395560Z scale_ub=1200.0, 2025-05-07T20:32:58.9395737Z contiguous=False, 2025-05-07T20:32:58.9395823Z compiled=True, 2025-05-07T20:32:58.9395908Z ) 2025-05-07T20:32:58.9396193Z self = 2025-05-07T20:32:58.9396378Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:58.9396383Z 2025-05-07T20:32:58.9396461Z @given( 2025-05-07T20:32:58.9396584Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.9396691Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.9396812Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.9396933Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.9397054Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.9397134Z ) 2025-05-07T20:32:58.9397388Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.9397484Z def test_silu_mul_quant( 2025-05-07T20:32:58.9397567Z self, 2025-05-07T20:32:58.9397655Z T: int, 2025-05-07T20:32:58.9397733Z D: int, 2025-05-07T20:32:58.9397836Z scale_ub: Optional[float], 2025-05-07T20:32:58.9397935Z contiguous: bool, 2025-05-07T20:32:58.9398023Z compiled: bool, 2025-05-07T20:32:58.9398106Z ) -> None: 2025-05-07T20:32:58.9398210Z torch.manual_seed(2025) 2025-05-07T20:32:58.9398286Z 2025-05-07T20:32:58.9398527Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.9398614Z 2025-05-07T20:32:58.9398709Z x_sign = torch.sign(x) 2025-05-07T20:32:58.9398835Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:58.9398934Z x = x_sign * x_clamp 2025-05-07T20:32:58.9399016Z x0 = x[:, :D] 2025-05-07T20:32:58.9399109Z x1 = x[:, D:] 2025-05-07T20:32:58.9399185Z 2025-05-07T20:32:58.9399273Z if contiguous: 2025-05-07T20:32:58.9399376Z x0 = x0.contiguous() 2025-05-07T20:32:58.9399471Z x1 = x1.contiguous() 2025-05-07T20:32:58.9399546Z 2025-05-07T20:32:58.9399647Z if scale_ub is not None: 2025-05-07T20:32:58.9399758Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:58.9399894Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:58.9399979Z ) 2025-05-07T20:32:58.9400058Z else: 2025-05-07T20:32:58.9400154Z scale_ub_tensor = None 2025-05-07T20:32:58.9400237Z 2025-05-07T20:32:58.9400368Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:58.9400511Z op = silu_mul_quant 2025-05-07T20:32:58.9400598Z if compiled: 2025-05-07T20:32:58.9400701Z op = torch.compile(op) 2025-05-07T20:32:58.9400813Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.9400891Z 2025-05-07T20:32:58.9400985Z > y_fp8, y_scale = fn() 2025-05-07T20:32:58.9400989Z 2025-05-07T20:32:58.9401095Z moe/activation_test.py:117: 2025-05-07T20:32:58.9401227Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.9401332Z moe/activation_test.py:115: in fn 2025-05-07T20:32:58.9401440Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.9401812Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:58.9401913Z return fn(*args, **kwargs) 2025-05-07T20:32:58.9402413Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:58.9402517Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:58.9402878Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:58.9403100Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:58.9403443Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:58.9403593Z kernel = self.compile( 2025-05-07T20:32:58.9404018Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:58.9404203Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:58.9404330Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.9404335Z 2025-05-07T20:32:58.9404539Z self = 2025-05-07T20:32:58.9405318Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:58.9405820Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f31337c9090>} 2025-05-07T20:32:58.9406572Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:58.9406763Z context = 2025-05-07T20:32:58.9406768Z 2025-05-07T20:32:58.9406983Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:58.9407252Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:58.9407364Z module_map=module_map) 2025-05-07T20:32:58.9407534Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:58.9407635Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:58.9407715Z E ^ 2025-05-07T20:32:58.9408073Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:58.9408080Z 2025-05-07T20:32:58.9408501Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:58.9408505Z 2025-05-07T20:32:58.9408615Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.9408835Z self=, 2025-05-07T20:32:58.9408917Z T=1, 2025-05-07T20:32:58.9409001Z D=5120, 2025-05-07T20:32:58.9409085Z scale_ub=None, 2025-05-07T20:32:58.9409218Z contiguous=False, 2025-05-07T20:32:58.9409310Z compiled=False, 2025-05-07T20:32:58.9409386Z ) 2025-05-07T20:32:58.9409602Z self = 2025-05-07T20:32:58.9409777Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:58.9409781Z 2025-05-07T20:32:58.9409859Z @given( 2025-05-07T20:32:58.9409984Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.9410089Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.9410210Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.9410340Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.9410455Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.9410533Z ) 2025-05-07T20:32:58.9410787Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.9410889Z def test_silu_mul_quant( 2025-05-07T20:32:58.9410979Z self, 2025-05-07T20:32:58.9411058Z T: int, 2025-05-07T20:32:58.9411138Z D: int, 2025-05-07T20:32:58.9411246Z scale_ub: Optional[float], 2025-05-07T20:32:58.9411340Z contiguous: bool, 2025-05-07T20:32:58.9411429Z compiled: bool, 2025-05-07T20:32:58.9411518Z ) -> None: 2025-05-07T20:32:58.9411614Z torch.manual_seed(2025) 2025-05-07T20:32:58.9411688Z 2025-05-07T20:32:58.9411866Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.9411993Z 2025-05-07T20:32:58.9412087Z x_sign = torch.sign(x) 2025-05-07T20:32:58.9412261Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:58.9412356Z x = x_sign * x_clamp 2025-05-07T20:32:58.9412439Z x0 = x[:, :D] 2025-05-07T20:32:58.9412528Z x1 = x[:, D:] 2025-05-07T20:32:58.9412604Z 2025-05-07T20:32:58.9412696Z if contiguous: 2025-05-07T20:32:58.9412791Z x0 = x0.contiguous() 2025-05-07T20:32:58.9412884Z x1 = x1.contiguous() 2025-05-07T20:32:58.9412965Z 2025-05-07T20:32:58.9413060Z if scale_ub is not None: 2025-05-07T20:32:58.9413167Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:58.9413309Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:58.9413388Z ) 2025-05-07T20:32:58.9413467Z else: 2025-05-07T20:32:58.9413569Z scale_ub_tensor = None 2025-05-07T20:32:58.9413649Z 2025-05-07T20:32:58.9413780Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:58.9413883Z op = silu_mul_quant 2025-05-07T20:32:58.9413971Z if compiled: 2025-05-07T20:32:58.9414078Z op = torch.compile(op) 2025-05-07T20:32:58.9414184Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.9414261Z 2025-05-07T20:32:58.9414360Z > y_fp8, y_scale = fn() 2025-05-07T20:32:58.9414365Z 2025-05-07T20:32:58.9414511Z moe/activation_test.py:117: 2025-05-07T20:32:58.9414646Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.9414756Z moe/activation_test.py:115: in fn 2025-05-07T20:32:58.9414857Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.9415352Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:58.9415455Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:58.9415814Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:58.9416043Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:58.9416382Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:58.9416477Z kernel = self.compile( 2025-05-07T20:32:58.9416866Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:58.9417091Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:58.9417225Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.9417230Z 2025-05-07T20:32:58.9417432Z self = 2025-05-07T20:32:58.9418203Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:58.9418704Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f31337c97e0>} 2025-05-07T20:32:58.9419445Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:58.9419642Z context = 2025-05-07T20:32:58.9419647Z 2025-05-07T20:32:58.9419918Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:58.9420182Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:58.9420294Z module_map=module_map) 2025-05-07T20:32:58.9420501Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:58.9420680Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:58.9420756Z E ^ 2025-05-07T20:32:58.9421105Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:58.9421110Z 2025-05-07T20:32:58.9421534Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:58.9421542Z 2025-05-07T20:32:58.9421647Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.9421876Z self=, 2025-05-07T20:32:58.9421951Z T=4096, 2025-05-07T20:32:58.9422029Z D=7168, 2025-05-07T20:32:58.9422120Z scale_ub=1200.0, 2025-05-07T20:32:58.9422207Z contiguous=False, 2025-05-07T20:32:58.9422290Z compiled=False, 2025-05-07T20:32:58.9422371Z ) 2025-05-07T20:32:58.9422589Z self = 2025-05-07T20:32:58.9422770Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:58.9422775Z 2025-05-07T20:32:58.9422860Z @given( 2025-05-07T20:32:58.9422978Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.9423076Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.9423239Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.9423357Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.9423480Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.9423555Z ) 2025-05-07T20:32:58.9423798Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.9423897Z def test_silu_mul_quant( 2025-05-07T20:32:58.9423971Z self, 2025-05-07T20:32:58.9424051Z T: int, 2025-05-07T20:32:58.9424134Z D: int, 2025-05-07T20:32:58.9424235Z scale_ub: Optional[float], 2025-05-07T20:32:58.9424326Z contiguous: bool, 2025-05-07T20:32:58.9424420Z compiled: bool, 2025-05-07T20:32:58.9424502Z ) -> None: 2025-05-07T20:32:58.9424604Z torch.manual_seed(2025) 2025-05-07T20:32:58.9424678Z 2025-05-07T20:32:58.9424846Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.9424924Z 2025-05-07T20:32:58.9425019Z x_sign = torch.sign(x) 2025-05-07T20:32:58.9425143Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:58.9425285Z x = x_sign * x_clamp 2025-05-07T20:32:58.9425380Z x0 = x[:, :D] 2025-05-07T20:32:58.9435508Z x1 = x[:, D:] 2025-05-07T20:32:58.9435624Z 2025-05-07T20:32:58.9435758Z if contiguous: 2025-05-07T20:32:58.9435872Z x0 = x0.contiguous() 2025-05-07T20:32:58.9435971Z x1 = x1.contiguous() 2025-05-07T20:32:58.9436050Z 2025-05-07T20:32:58.9436155Z if scale_ub is not None: 2025-05-07T20:32:58.9436280Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:58.9436431Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:58.9436524Z ) 2025-05-07T20:32:58.9436607Z else: 2025-05-07T20:32:58.9436709Z scale_ub_tensor = None 2025-05-07T20:32:58.9436797Z 2025-05-07T20:32:58.9436938Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:58.9437049Z op = silu_mul_quant 2025-05-07T20:32:58.9437143Z if compiled: 2025-05-07T20:32:58.9437256Z op = torch.compile(op) 2025-05-07T20:32:58.9437376Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.9437453Z 2025-05-07T20:32:58.9437551Z > y_fp8, y_scale = fn() 2025-05-07T20:32:58.9437557Z 2025-05-07T20:32:58.9437671Z moe/activation_test.py:117: 2025-05-07T20:32:58.9437808Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.9438016Z moe/activation_test.py:115: in fn 2025-05-07T20:32:58.9438132Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.9438695Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:58.9438822Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:58.9439263Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:58.9439530Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:58.9439961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:58.9440066Z kernel = self.compile( 2025-05-07T20:32:58.9440543Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:58.9440747Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:58.9440899Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.9440904Z 2025-05-07T20:32:58.9441153Z self = 2025-05-07T20:32:58.9442235Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:58.9442883Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f31337ca200>} 2025-05-07T20:32:58.9443820Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:58.9444045Z context = 2025-05-07T20:32:58.9444053Z 2025-05-07T20:32:58.9444255Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:58.9444572Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:58.9444700Z module_map=module_map) 2025-05-07T20:32:58.9444884Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:58.9444997Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:58.9445152Z E ^ 2025-05-07T20:32:58.9445585Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:58.9445590Z 2025-05-07T20:32:58.9446100Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:58.9446111Z 2025-05-07T20:32:58.9446230Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.9446504Z self=, 2025-05-07T20:32:58.9446600Z T=16384, 2025-05-07T20:32:58.9446686Z D=7168, 2025-05-07T20:32:58.9446778Z scale_ub=None, 2025-05-07T20:32:58.9446883Z contiguous=True, 2025-05-07T20:32:58.9446975Z compiled=True, 2025-05-07T20:32:58.9447065Z ) 2025-05-07T20:32:58.9447331Z self = 2025-05-07T20:32:58.9447533Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:58.9447540Z 2025-05-07T20:32:58.9447635Z @given( 2025-05-07T20:32:58.9447771Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.9447891Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.9448048Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.9448205Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.9448332Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.9448470Z ) 2025-05-07T20:32:58.9448767Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.9448910Z def test_silu_mul_quant( 2025-05-07T20:32:58.9449003Z self, 2025-05-07T20:32:58.9449089Z T: int, 2025-05-07T20:32:58.9449175Z D: int, 2025-05-07T20:32:58.9449298Z scale_ub: Optional[float], 2025-05-07T20:32:58.9449399Z contiguous: bool, 2025-05-07T20:32:58.9449505Z compiled: bool, 2025-05-07T20:32:58.9449593Z ) -> None: 2025-05-07T20:32:58.9449701Z torch.manual_seed(2025) 2025-05-07T20:32:58.9449787Z 2025-05-07T20:32:58.9449978Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.9450060Z 2025-05-07T20:32:58.9450171Z x_sign = torch.sign(x) 2025-05-07T20:32:58.9450311Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:58.9450408Z x = x_sign * x_clamp 2025-05-07T20:32:58.9450504Z x0 = x[:, :D] 2025-05-07T20:32:58.9450593Z x1 = x[:, D:] 2025-05-07T20:32:58.9450674Z 2025-05-07T20:32:58.9450774Z if contiguous: 2025-05-07T20:32:58.9450880Z x0 = x0.contiguous() 2025-05-07T20:32:58.9450986Z x1 = x1.contiguous() 2025-05-07T20:32:58.9451067Z 2025-05-07T20:32:58.9451167Z if scale_ub is not None: 2025-05-07T20:32:58.9451290Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:58.9451487Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:58.9451574Z ) 2025-05-07T20:32:58.9451666Z else: 2025-05-07T20:32:58.9451772Z scale_ub_tensor = None 2025-05-07T20:32:58.9451853Z 2025-05-07T20:32:58.9452007Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:58.9452106Z op = silu_mul_quant 2025-05-07T20:32:58.9452200Z if compiled: 2025-05-07T20:32:58.9452322Z op = torch.compile(op) 2025-05-07T20:32:58.9452479Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.9452602Z 2025-05-07T20:32:58.9452706Z > y_fp8, y_scale = fn() 2025-05-07T20:32:58.9452711Z 2025-05-07T20:32:58.9452828Z moe/activation_test.py:117: 2025-05-07T20:32:58.9473863Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.9474019Z moe/activation_test.py:115: in fn 2025-05-07T20:32:58.9474133Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.9474526Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:58.9474756Z return fn(*args, **kwargs) 2025-05-07T20:32:58.9475267Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:58.9475372Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:58.9475732Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:58.9475970Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:58.9476316Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:58.9476417Z kernel = self.compile( 2025-05-07T20:32:58.9476811Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:58.9476994Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:58.9477137Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.9477142Z 2025-05-07T20:32:58.9477351Z self = 2025-05-07T20:32:58.9478174Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:58.9478787Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f31337cb760>} 2025-05-07T20:32:58.9479546Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:58.9479749Z context = 2025-05-07T20:32:58.9479758Z 2025-05-07T20:32:58.9479929Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:58.9480194Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:58.9480314Z module_map=module_map) 2025-05-07T20:32:58.9480483Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:58.9480603Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:58.9480688Z E ^ 2025-05-07T20:32:58.9481050Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:58.9481055Z 2025-05-07T20:32:58.9481482Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:58.9481487Z 2025-05-07T20:32:58.9481644Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.9481879Z self=, 2025-05-07T20:32:58.9481961Z T=4096, 2025-05-07T20:32:58.9482043Z D=5120, 2025-05-07T20:32:58.9482135Z scale_ub=None, 2025-05-07T20:32:58.9482226Z contiguous=False, 2025-05-07T20:32:58.9482310Z compiled=True, 2025-05-07T20:32:58.9482398Z ) 2025-05-07T20:32:58.9482618Z self = 2025-05-07T20:32:58.9482796Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:58.9482801Z 2025-05-07T20:32:58.9482892Z @given( 2025-05-07T20:32:58.9483017Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.9483134Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.9483252Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.9483373Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.9483500Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.9483625Z ) 2025-05-07T20:32:58.9483875Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.9483985Z def test_silu_mul_quant( 2025-05-07T20:32:58.9484065Z self, 2025-05-07T20:32:58.9484145Z T: int, 2025-05-07T20:32:58.9484231Z D: int, 2025-05-07T20:32:58.9484334Z scale_ub: Optional[float], 2025-05-07T20:32:58.9484428Z contiguous: bool, 2025-05-07T20:32:58.9484535Z compiled: bool, 2025-05-07T20:32:58.9484618Z ) -> None: 2025-05-07T20:32:58.9484724Z torch.manual_seed(2025) 2025-05-07T20:32:58.9484801Z 2025-05-07T20:32:58.9484973Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.9485059Z 2025-05-07T20:32:58.9485154Z x_sign = torch.sign(x) 2025-05-07T20:32:58.9485281Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:58.9485385Z x = x_sign * x_clamp 2025-05-07T20:32:58.9485471Z x0 = x[:, :D] 2025-05-07T20:32:58.9485557Z x1 = x[:, D:] 2025-05-07T20:32:58.9485639Z 2025-05-07T20:32:58.9485730Z if contiguous: 2025-05-07T20:32:58.9485824Z x0 = x0.contiguous() 2025-05-07T20:32:58.9485928Z x1 = x1.contiguous() 2025-05-07T20:32:58.9486007Z 2025-05-07T20:32:58.9486111Z if scale_ub is not None: 2025-05-07T20:32:58.9486220Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:58.9486356Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:58.9486493Z ) 2025-05-07T20:32:58.9486573Z else: 2025-05-07T20:32:58.9486711Z scale_ub_tensor = None 2025-05-07T20:32:58.9486793Z 2025-05-07T20:32:58.9486925Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:58.9487016Z op = silu_mul_quant 2025-05-07T20:32:58.9487110Z if compiled: 2025-05-07T20:32:58.9487218Z op = torch.compile(op) 2025-05-07T20:32:58.9487328Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.9487413Z 2025-05-07T20:32:58.9487511Z > y_fp8, y_scale = fn() 2025-05-07T20:32:58.9487516Z 2025-05-07T20:32:58.9487620Z moe/activation_test.py:117: 2025-05-07T20:32:58.9487748Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.9487857Z moe/activation_test.py:115: in fn 2025-05-07T20:32:58.9487982Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.9488387Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:58.9488485Z return fn(*args, **kwargs) 2025-05-07T20:32:58.9488989Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:58.9489087Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:58.9489535Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:58.9489767Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:58.9490428Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:58.9490534Z kernel = self.compile( 2025-05-07T20:32:58.9490915Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:58.9491098Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:58.9491239Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.9491244Z 2025-05-07T20:32:58.9491448Z self = 2025-05-07T20:32:58.9492227Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:58.9492873Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3133d2c280>} 2025-05-07T20:32:58.9493616Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:58.9493808Z context = 2025-05-07T20:32:58.9493813Z 2025-05-07T20:32:58.9493988Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:58.9494257Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:58.9494367Z module_map=module_map) 2025-05-07T20:32:58.9494540Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:58.9494643Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:58.9494720Z E ^ 2025-05-07T20:32:58.9495082Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:58.9495087Z 2025-05-07T20:32:58.9495495Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:58.9495500Z 2025-05-07T20:32:58.9495698Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.9495921Z self=, 2025-05-07T20:32:58.9496062Z T=4096, 2025-05-07T20:32:58.9496151Z D=5120, 2025-05-07T20:32:58.9496237Z scale_ub=1200.0, 2025-05-07T20:32:58.9496323Z contiguous=False, 2025-05-07T20:32:58.9496417Z compiled=False, 2025-05-07T20:32:58.9496493Z ) 2025-05-07T20:32:58.9496720Z self = 2025-05-07T20:32:58.9496897Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:58.9496902Z 2025-05-07T20:32:58.9496979Z @given( 2025-05-07T20:32:58.9497106Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.9497206Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.9497323Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.9497450Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.9497569Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.9497647Z ) 2025-05-07T20:32:58.9497901Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.9497997Z def test_silu_mul_quant( 2025-05-07T20:32:58.9498079Z self, 2025-05-07T20:32:58.9498159Z T: int, 2025-05-07T20:32:58.9498234Z D: int, 2025-05-07T20:32:58.9498345Z scale_ub: Optional[float], 2025-05-07T20:32:58.9498528Z contiguous: bool, 2025-05-07T20:32:58.9498643Z compiled: bool, 2025-05-07T20:32:58.9498733Z ) -> None: 2025-05-07T20:32:58.9498832Z torch.manual_seed(2025) 2025-05-07T20:32:58.9498908Z 2025-05-07T20:32:58.9499089Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.9499165Z 2025-05-07T20:32:58.9499264Z x_sign = torch.sign(x) 2025-05-07T20:32:58.9499388Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:58.9499488Z x = x_sign * x_clamp 2025-05-07T20:32:58.9499567Z x0 = x[:, :D] 2025-05-07T20:32:58.9499645Z x1 = x[:, D:] 2025-05-07T20:32:58.9499729Z 2025-05-07T20:32:58.9499922Z if contiguous: 2025-05-07T20:32:58.9500015Z x0 = x0.contiguous() 2025-05-07T20:32:58.9500112Z x1 = x1.contiguous() 2025-05-07T20:32:58.9500187Z 2025-05-07T20:32:58.9500279Z if scale_ub is not None: 2025-05-07T20:32:58.9500396Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:58.9500535Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:58.9500656Z ) 2025-05-07T20:32:58.9500745Z else: 2025-05-07T20:32:58.9500842Z scale_ub_tensor = None 2025-05-07T20:32:58.9500920Z 2025-05-07T20:32:58.9501057Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:58.9501149Z op = silu_mul_quant 2025-05-07T20:32:58.9501244Z if compiled: 2025-05-07T20:32:58.9501345Z op = torch.compile(op) 2025-05-07T20:32:58.9501456Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.9501535Z 2025-05-07T20:32:58.9501629Z > y_fp8, y_scale = fn() 2025-05-07T20:32:58.9501633Z 2025-05-07T20:32:58.9501734Z moe/activation_test.py:117: 2025-05-07T20:32:58.9501870Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.9501972Z moe/activation_test.py:115: in fn 2025-05-07T20:32:58.9502076Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.9502577Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:58.9502676Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:58.9503040Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:58.9503260Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:58.9503653Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:58.9503795Z kernel = self.compile( 2025-05-07T20:32:58.9504182Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:58.9504368Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:58.9504496Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.9504503Z 2025-05-07T20:32:58.9504705Z self = 2025-05-07T20:32:58.9505484Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:58.9505987Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3133d2d000>} 2025-05-07T20:32:58.9506740Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:58.9506928Z context = 2025-05-07T20:32:58.9506971Z 2025-05-07T20:32:58.9507136Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:58.9507412Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:58.9507519Z module_map=module_map) 2025-05-07T20:32:58.9507688Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:58.9507787Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:58.9507876Z E ^ 2025-05-07T20:32:58.9508270Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:58.9508277Z 2025-05-07T20:32:58.9508694Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:58.9508698Z 2025-05-07T20:32:58.9508807Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.9509029Z self=, 2025-05-07T20:32:58.9509105Z T=4096, 2025-05-07T20:32:58.9509231Z D=5120, 2025-05-07T20:32:58.9509315Z scale_ub=1200.0, 2025-05-07T20:32:58.9509400Z contiguous=False, 2025-05-07T20:32:58.9509491Z compiled=True, 2025-05-07T20:32:58.9509565Z ) 2025-05-07T20:32:58.9509779Z self = 2025-05-07T20:32:58.9509958Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:58.9509963Z 2025-05-07T20:32:58.9510043Z @given( 2025-05-07T20:32:58.9510169Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.9510268Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.9510385Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.9510507Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.9510619Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.9510693Z ) 2025-05-07T20:32:58.9510948Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.9511044Z def test_silu_mul_quant( 2025-05-07T20:32:58.9511118Z self, 2025-05-07T20:32:58.9511203Z T: int, 2025-05-07T20:32:58.9511282Z D: int, 2025-05-07T20:32:58.9511387Z scale_ub: Optional[float], 2025-05-07T20:32:58.9511478Z contiguous: bool, 2025-05-07T20:32:58.9511563Z compiled: bool, 2025-05-07T20:32:58.9511644Z ) -> None: 2025-05-07T20:32:58.9511740Z torch.manual_seed(2025) 2025-05-07T20:32:58.9511862Z 2025-05-07T20:32:58.9512037Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.9512152Z 2025-05-07T20:32:58.9512245Z x_sign = torch.sign(x) 2025-05-07T20:32:58.9512376Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:58.9512467Z x = x_sign * x_clamp 2025-05-07T20:32:58.9512548Z x0 = x[:, :D] 2025-05-07T20:32:58.9512636Z x1 = x[:, D:] 2025-05-07T20:32:58.9512714Z 2025-05-07T20:32:58.9512797Z if contiguous: 2025-05-07T20:32:58.9512898Z x0 = x0.contiguous() 2025-05-07T20:32:58.9512990Z x1 = x1.contiguous() 2025-05-07T20:32:58.9513069Z 2025-05-07T20:32:58.9513159Z if scale_ub is not None: 2025-05-07T20:32:58.9513266Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:58.9513406Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:58.9513486Z ) 2025-05-07T20:32:58.9513563Z else: 2025-05-07T20:32:58.9513668Z scale_ub_tensor = None 2025-05-07T20:32:58.9513741Z 2025-05-07T20:32:58.9513873Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:58.9513970Z op = silu_mul_quant 2025-05-07T20:32:58.9514056Z if compiled: 2025-05-07T20:32:58.9514159Z op = torch.compile(op) 2025-05-07T20:32:58.9514272Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.9514346Z 2025-05-07T20:32:58.9514490Z > y_fp8, y_scale = fn() 2025-05-07T20:32:58.9514498Z 2025-05-07T20:32:58.9514598Z moe/activation_test.py:117: 2025-05-07T20:32:58.9514727Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.9514837Z moe/activation_test.py:115: in fn 2025-05-07T20:32:58.9514936Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.9515306Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:58.9515412Z return fn(*args, **kwargs) 2025-05-07T20:32:58.9515912Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:58.9516017Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:58.9516377Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:58.9516600Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:58.9517017Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:58.9517112Z kernel = self.compile( 2025-05-07T20:32:58.9517490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:58.9517675Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:58.9517802Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.9517809Z 2025-05-07T20:32:58.9518022Z self = 2025-05-07T20:32:58.9518791Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:58.9519304Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3133d2c700>} 2025-05-07T20:32:58.9520046Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:58.9520235Z context = 2025-05-07T20:32:58.9520284Z 2025-05-07T20:32:58.9520456Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:58.9520796Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:58.9520916Z module_map=module_map) 2025-05-07T20:32:58.9521080Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:58.9521177Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:58.9521270Z E ^ 2025-05-07T20:32:58.9521622Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:58.9521630Z 2025-05-07T20:32:58.9522045Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:58.9522057Z 2025-05-07T20:32:58.9522160Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.9522379Z self=, 2025-05-07T20:32:58.9522468Z T=2048, 2025-05-07T20:32:58.9522542Z D=7168, 2025-05-07T20:32:58.9522625Z scale_ub=1200.0, 2025-05-07T20:32:58.9522723Z contiguous=False, 2025-05-07T20:32:58.9522805Z compiled=False, 2025-05-07T20:32:58.9522876Z ) 2025-05-07T20:32:58.9523096Z self = 2025-05-07T20:32:58.9523315Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:58.9523320Z 2025-05-07T20:32:58.9523400Z @given( 2025-05-07T20:32:58.9523526Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.9523624Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.9523746Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.9523862Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.9523973Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.9524054Z ) 2025-05-07T20:32:58.9524301Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.9524395Z def test_silu_mul_quant( 2025-05-07T20:32:58.9524477Z self, 2025-05-07T20:32:58.9524552Z T: int, 2025-05-07T20:32:58.9524629Z D: int, 2025-05-07T20:32:58.9524735Z scale_ub: Optional[float], 2025-05-07T20:32:58.9524825Z contiguous: bool, 2025-05-07T20:32:58.9524918Z compiled: bool, 2025-05-07T20:32:58.9524997Z ) -> None: 2025-05-07T20:32:58.9525093Z torch.manual_seed(2025) 2025-05-07T20:32:58.9525222Z 2025-05-07T20:32:58.9525391Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.9525466Z 2025-05-07T20:32:58.9525563Z x_sign = torch.sign(x) 2025-05-07T20:32:58.9525686Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:58.9525775Z x = x_sign * x_clamp 2025-05-07T20:32:58.9525861Z x0 = x[:, :D] 2025-05-07T20:32:58.9525940Z x1 = x[:, D:] 2025-05-07T20:32:58.9526015Z 2025-05-07T20:32:58.9526104Z if contiguous: 2025-05-07T20:32:58.9526196Z x0 = x0.contiguous() 2025-05-07T20:32:58.9526286Z x1 = x1.contiguous() 2025-05-07T20:32:58.9526363Z 2025-05-07T20:32:58.9526453Z if scale_ub is not None: 2025-05-07T20:32:58.9526563Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:58.9526697Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:58.9526780Z ) 2025-05-07T20:32:58.9526866Z else: 2025-05-07T20:32:58.9526962Z scale_ub_tensor = None 2025-05-07T20:32:58.9527036Z 2025-05-07T20:32:58.9527168Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:58.9527260Z op = silu_mul_quant 2025-05-07T20:32:58.9527345Z if compiled: 2025-05-07T20:32:58.9527451Z op = torch.compile(op) 2025-05-07T20:32:58.9527557Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.9527676Z 2025-05-07T20:32:58.9527776Z > y_fp8, y_scale = fn() 2025-05-07T20:32:58.9527780Z 2025-05-07T20:32:58.9527879Z moe/activation_test.py:117: 2025-05-07T20:32:58.9528053Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.9528158Z moe/activation_test.py:115: in fn 2025-05-07T20:32:58.9528257Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.9528766Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:58.9528866Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:58.9529222Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:58.9529449Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:58.9529787Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:58.9529894Z kernel = self.compile( 2025-05-07T20:32:58.9530277Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:58.9530451Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:58.9530585Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.9530589Z 2025-05-07T20:32:58.9530834Z self = 2025-05-07T20:32:58.9531612Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:58.9532121Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3133d2d240>} 2025-05-07T20:32:58.9532864Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:58.9533060Z context = 2025-05-07T20:32:58.9533065Z 2025-05-07T20:32:58.9533229Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:58.9533505Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:58.9533654Z module_map=module_map) 2025-05-07T20:32:58.9533818Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:58.9533923Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:58.9534006Z E ^ 2025-05-07T20:32:58.9534368Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:58.9534375Z 2025-05-07T20:32:58.9534792Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:58.9534797Z 2025-05-07T20:32:58.9534899Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.9535128Z self=, 2025-05-07T20:32:58.9535206Z T=1, 2025-05-07T20:32:58.9535284Z D=7168, 2025-05-07T20:32:58.9535381Z scale_ub=None, 2025-05-07T20:32:58.9535467Z contiguous=True, 2025-05-07T20:32:58.9535560Z compiled=False, 2025-05-07T20:32:58.9535637Z ) 2025-05-07T20:32:58.9535850Z self = 2025-05-07T20:32:58.9536021Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:58.9536026Z 2025-05-07T20:32:58.9536104Z @given( 2025-05-07T20:32:58.9536223Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.9536330Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.9536491Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.9536650Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.9536773Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.9536849Z ) 2025-05-07T20:32:58.9537099Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.9537193Z def test_silu_mul_quant( 2025-05-07T20:32:58.9537294Z self, 2025-05-07T20:32:58.9537379Z T: int, 2025-05-07T20:32:58.9537457Z D: int, 2025-05-07T20:32:58.9537566Z scale_ub: Optional[float], 2025-05-07T20:32:58.9537657Z contiguous: bool, 2025-05-07T20:32:58.9537745Z compiled: bool, 2025-05-07T20:32:58.9537829Z ) -> None: 2025-05-07T20:32:58.9537924Z torch.manual_seed(2025) 2025-05-07T20:32:58.9538006Z 2025-05-07T20:32:58.9538181Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.9538257Z 2025-05-07T20:32:58.9538351Z x_sign = torch.sign(x) 2025-05-07T20:32:58.9538485Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:58.9538575Z x = x_sign * x_clamp 2025-05-07T20:32:58.9538657Z x0 = x[:, :D] 2025-05-07T20:32:58.9538743Z x1 = x[:, D:] 2025-05-07T20:32:58.9538818Z 2025-05-07T20:32:58.9538904Z if contiguous: 2025-05-07T20:32:58.9539048Z x0 = x0.contiguous() 2025-05-07T20:32:58.9539139Z x1 = x1.contiguous() 2025-05-07T20:32:58.9539223Z 2025-05-07T20:32:58.9539313Z if scale_ub is not None: 2025-05-07T20:32:58.9539419Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:58.9539559Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:58.9539635Z ) 2025-05-07T20:32:58.9539710Z else: 2025-05-07T20:32:58.9539958Z scale_ub_tensor = None 2025-05-07T20:32:58.9540038Z 2025-05-07T20:32:58.9540171Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:58.9540270Z op = silu_mul_quant 2025-05-07T20:32:58.9540359Z if compiled: 2025-05-07T20:32:58.9540460Z op = torch.compile(op) 2025-05-07T20:32:58.9540575Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.9540650Z 2025-05-07T20:32:58.9540749Z > y_fp8, y_scale = fn() 2025-05-07T20:32:58.9540754Z 2025-05-07T20:32:58.9540856Z moe/activation_test.py:117: 2025-05-07T20:32:58.9540983Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.9541140Z moe/activation_test.py:115: in fn 2025-05-07T20:32:58.9541238Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.9541739Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:58.9541841Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:58.9542195Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:58.9542429Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:58.9542768Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:58.9542860Z kernel = self.compile( 2025-05-07T20:32:58.9543247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:58.9543432Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:58.9543556Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.9543568Z 2025-05-07T20:32:58.9543773Z self = 2025-05-07T20:32:58.9544542Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:58.9545130Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3133d2e050>} 2025-05-07T20:32:58.9545884Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:58.9546081Z context = 2025-05-07T20:32:58.9546085Z 2025-05-07T20:32:58.9546249Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:58.9546514Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:58.9546627Z module_map=module_map) 2025-05-07T20:32:58.9546794Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:58.9546898Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:58.9546975Z E ^ 2025-05-07T20:32:58.9547326Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:58.9547332Z 2025-05-07T20:32:58.9547796Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:58.9547801Z 2025-05-07T20:32:58.9547908Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.9548127Z self=, 2025-05-07T20:32:58.9548215Z T=16384, 2025-05-07T20:32:58.9548292Z D=7168, 2025-05-07T20:32:58.9548383Z scale_ub=1200.0, 2025-05-07T20:32:58.9548470Z contiguous=False, 2025-05-07T20:32:58.9548552Z compiled=True, 2025-05-07T20:32:58.9548634Z ) 2025-05-07T20:32:58.9548850Z self = 2025-05-07T20:32:58.9549036Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:58.9549040Z 2025-05-07T20:32:58.9549119Z @given( 2025-05-07T20:32:58.9549235Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.9549336Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.9549455Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.9549576Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.9549746Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.9549821Z ) 2025-05-07T20:32:58.9550070Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.9550170Z def test_silu_mul_quant( 2025-05-07T20:32:58.9550247Z self, 2025-05-07T20:32:58.9550323Z T: int, 2025-05-07T20:32:58.9550409Z D: int, 2025-05-07T20:32:58.9550507Z scale_ub: Optional[float], 2025-05-07T20:32:58.9550603Z contiguous: bool, 2025-05-07T20:32:58.9550695Z compiled: bool, 2025-05-07T20:32:58.9550772Z ) -> None: 2025-05-07T20:32:58.9550869Z torch.manual_seed(2025) 2025-05-07T20:32:58.9550953Z 2025-05-07T20:32:58.9551120Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.9551204Z 2025-05-07T20:32:58.9551298Z x_sign = torch.sign(x) 2025-05-07T20:32:58.9551427Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:58.9551528Z x = x_sign * x_clamp 2025-05-07T20:32:58.9551609Z x0 = x[:, :D] 2025-05-07T20:32:58.9551691Z x1 = x[:, D:] 2025-05-07T20:32:58.9551775Z 2025-05-07T20:32:58.9551860Z if contiguous: 2025-05-07T20:32:58.9551953Z x0 = x0.contiguous() 2025-05-07T20:32:58.9552049Z x1 = x1.contiguous() 2025-05-07T20:32:58.9552123Z 2025-05-07T20:32:58.9552212Z if scale_ub is not None: 2025-05-07T20:32:58.9552408Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:58.9552543Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:58.9552670Z ) 2025-05-07T20:32:58.9552747Z else: 2025-05-07T20:32:58.9552844Z scale_ub_tensor = None 2025-05-07T20:32:58.9552923Z 2025-05-07T20:32:58.9553055Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:58.9553145Z op = silu_mul_quant 2025-05-07T20:32:58.9553241Z if compiled: 2025-05-07T20:32:58.9553343Z op = torch.compile(op) 2025-05-07T20:32:58.9553452Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.9553539Z 2025-05-07T20:32:58.9553629Z > y_fp8, y_scale = fn() 2025-05-07T20:32:58.9553634Z 2025-05-07T20:32:58.9553732Z moe/activation_test.py:117: 2025-05-07T20:32:58.9553866Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.9553967Z moe/activation_test.py:115: in fn 2025-05-07T20:32:58.9554079Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.9554447Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:58.9554541Z return fn(*args, **kwargs) 2025-05-07T20:32:58.9555047Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:58.9555189Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:58.9555549Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:58.9555784Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:58.9556126Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:58.9556226Z kernel = self.compile( 2025-05-07T20:32:58.9556610Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:58.9556792Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:58.9556922Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.9556926Z 2025-05-07T20:32:58.9557131Z self = 2025-05-07T20:32:58.9557909Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:58.9558466Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3133d2f490>} 2025-05-07T20:32:58.9559327Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:58.9559536Z context = 2025-05-07T20:32:58.9559541Z 2025-05-07T20:32:58.9559706Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:58.9559978Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:58.9560086Z module_map=module_map) 2025-05-07T20:32:58.9560252Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:58.9560356Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:58.9560433Z E ^ 2025-05-07T20:32:58.9560790Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:58.9560795Z 2025-05-07T20:32:58.9561202Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:58.9561262Z 2025-05-07T20:32:58.9561370Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.9561638Z self=, 2025-05-07T20:32:58.9561711Z T=1, 2025-05-07T20:32:58.9561788Z D=7168, 2025-05-07T20:32:58.9561875Z scale_ub=None, 2025-05-07T20:32:58.9561962Z contiguous=False, 2025-05-07T20:32:58.9562049Z compiled=False, 2025-05-07T20:32:58.9562125Z ) 2025-05-07T20:32:58.9562338Z self = 2025-05-07T20:32:58.9562514Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:58.9562518Z 2025-05-07T20:32:58.9562593Z @given( 2025-05-07T20:32:58.9562710Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.9562816Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.9562931Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.9563054Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.9563175Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.9563252Z ) 2025-05-07T20:32:58.9563503Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.9563595Z def test_silu_mul_quant( 2025-05-07T20:32:58.9563671Z self, 2025-05-07T20:32:58.9563752Z T: int, 2025-05-07T20:32:58.9563827Z D: int, 2025-05-07T20:32:58.9563965Z scale_ub: Optional[float], 2025-05-07T20:32:58.9564064Z contiguous: bool, 2025-05-07T20:32:58.9564148Z compiled: bool, 2025-05-07T20:32:58.9564223Z ) -> None: 2025-05-07T20:32:58.9564324Z torch.manual_seed(2025) 2025-05-07T20:32:58.9564399Z 2025-05-07T20:32:58.9564568Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.9564648Z 2025-05-07T20:32:58.9564740Z x_sign = torch.sign(x) 2025-05-07T20:32:58.9564871Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:58.9564965Z x = x_sign * x_clamp 2025-05-07T20:32:58.9565043Z x0 = x[:, :D] 2025-05-07T20:32:58.9565133Z x1 = x[:, D:] 2025-05-07T20:32:58.9565206Z 2025-05-07T20:32:58.9565288Z if contiguous: 2025-05-07T20:32:58.9565390Z x0 = x0.contiguous() 2025-05-07T20:32:58.9565478Z x1 = x1.contiguous() 2025-05-07T20:32:58.9565552Z 2025-05-07T20:32:58.9565652Z if scale_ub is not None: 2025-05-07T20:32:58.9565804Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:58.9565937Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:58.9566023Z ) 2025-05-07T20:32:58.9566102Z else: 2025-05-07T20:32:58.9566202Z scale_ub_tensor = None 2025-05-07T20:32:58.9566276Z 2025-05-07T20:32:58.9566405Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:58.9566506Z op = silu_mul_quant 2025-05-07T20:32:58.9566597Z if compiled: 2025-05-07T20:32:58.9566694Z op = torch.compile(op) 2025-05-07T20:32:58.9566807Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.9566876Z 2025-05-07T20:32:58.9566967Z > y_fp8, y_scale = fn() 2025-05-07T20:32:58.9566972Z 2025-05-07T20:32:58.9567073Z moe/activation_test.py:117: 2025-05-07T20:32:58.9567199Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.9567302Z moe/activation_test.py:115: in fn 2025-05-07T20:32:58.9567410Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.9567906Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:58.9568009Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:58.9568363Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:58.9568582Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:58.9569011Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:58.9569105Z kernel = self.compile( 2025-05-07T20:32:58.9569504Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:58.9569757Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:58.9569918Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.9569927Z 2025-05-07T20:32:58.9570142Z self = 2025-05-07T20:32:58.9570910Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:58.9571420Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3133d2f7f0>} 2025-05-07T20:32:58.9572158Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:58.9572405Z context = 2025-05-07T20:32:58.9572416Z 2025-05-07T20:32:58.9572592Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:58.9572856Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:58.9572971Z module_map=module_map) 2025-05-07T20:32:58.9573132Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:58.9573228Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:58.9573314Z E ^ 2025-05-07T20:32:58.9573667Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:58.9573672Z 2025-05-07T20:32:58.9574094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:58.9574098Z 2025-05-07T20:32:58.9574202Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.9574425Z self=, 2025-05-07T20:32:58.9574555Z T=2048, 2025-05-07T20:32:58.9574632Z D=7168, 2025-05-07T20:32:58.9574712Z scale_ub=None, 2025-05-07T20:32:58.9574805Z contiguous=False, 2025-05-07T20:32:58.9574885Z compiled=True, 2025-05-07T20:32:58.9574955Z ) 2025-05-07T20:32:58.9575177Z self = 2025-05-07T20:32:58.9575348Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:58.9575354Z 2025-05-07T20:32:58.9575438Z @given( 2025-05-07T20:32:58.9575559Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.9575660Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.9575781Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.9575896Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.9576010Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.9576094Z ) 2025-05-07T20:32:58.9576344Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.9576438Z def test_silu_mul_quant( 2025-05-07T20:32:58.9576522Z self, 2025-05-07T20:32:58.9576595Z T: int, 2025-05-07T20:32:58.9576676Z D: int, 2025-05-07T20:32:58.9576773Z scale_ub: Optional[float], 2025-05-07T20:32:58.9576863Z contiguous: bool, 2025-05-07T20:32:58.9576955Z compiled: bool, 2025-05-07T20:32:58.9577082Z ) -> None: 2025-05-07T20:32:58.9577176Z torch.manual_seed(2025) 2025-05-07T20:32:58.9577254Z 2025-05-07T20:32:58.9577460Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.9577537Z 2025-05-07T20:32:58.9577637Z x_sign = torch.sign(x) 2025-05-07T20:32:58.9577763Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:58.9577852Z x = x_sign * x_clamp 2025-05-07T20:32:58.9577940Z x0 = x[:, :D] 2025-05-07T20:32:58.9578019Z x1 = x[:, D:] 2025-05-07T20:32:58.9578094Z 2025-05-07T20:32:58.9578183Z if contiguous: 2025-05-07T20:32:58.9578275Z x0 = x0.contiguous() 2025-05-07T20:32:58.9578370Z x1 = x1.contiguous() 2025-05-07T20:32:58.9578446Z 2025-05-07T20:32:58.9578537Z if scale_ub is not None: 2025-05-07T20:32:58.9578650Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:58.9578786Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:58.9578868Z ) 2025-05-07T20:32:58.9578953Z else: 2025-05-07T20:32:58.9579047Z scale_ub_tensor = None 2025-05-07T20:32:58.9579125Z 2025-05-07T20:32:58.9579264Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:58.9579352Z op = silu_mul_quant 2025-05-07T20:32:58.9585706Z if compiled: 2025-05-07T20:32:58.9585835Z op = torch.compile(op) 2025-05-07T20:32:58.9586049Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.9586138Z 2025-05-07T20:32:58.9586235Z > y_fp8, y_scale = fn() 2025-05-07T20:32:58.9586241Z 2025-05-07T20:32:58.9586352Z moe/activation_test.py:117: 2025-05-07T20:32:58.9586486Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.9586593Z moe/activation_test.py:115: in fn 2025-05-07T20:32:58.9586708Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.9587085Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:58.9587186Z return fn(*args, **kwargs) 2025-05-07T20:32:58.9587696Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:58.9587800Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:58.9588171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:58.9588396Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:58.9588828Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:58.9588934Z kernel = self.compile( 2025-05-07T20:32:58.9589325Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:58.9589504Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:58.9589644Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.9589651Z 2025-05-07T20:32:58.9590766Z self = 2025-05-07T20:32:58.9592155Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:58.9592918Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f31334d4af0>} 2025-05-07T20:32:58.9594013Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:58.9594287Z context = 2025-05-07T20:32:58.9594590Z 2025-05-07T20:32:58.9594935Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:58.9595225Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:58.9595338Z module_map=module_map) 2025-05-07T20:32:58.9595506Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:58.9595618Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:58.9595699Z E ^ 2025-05-07T20:32:58.9596065Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:58.9596071Z 2025-05-07T20:32:58.9596487Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:58.9596492Z 2025-05-07T20:32:58.9596596Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.9596834Z self=, 2025-05-07T20:32:58.9596911Z T=4096, 2025-05-07T20:32:58.9597000Z D=7168, 2025-05-07T20:32:58.9597082Z scale_ub=None, 2025-05-07T20:32:58.9597170Z contiguous=False, 2025-05-07T20:32:58.9597260Z compiled=True, 2025-05-07T20:32:58.9597332Z ) 2025-05-07T20:32:58.9597549Z self = 2025-05-07T20:32:58.9597812Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:58.9597821Z 2025-05-07T20:32:58.9597900Z @given( 2025-05-07T20:32:58.9598022Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.9598131Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.9598246Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.9598371Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.9598485Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.9598562Z ) 2025-05-07T20:32:58.9598815Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.9598914Z def test_silu_mul_quant( 2025-05-07T20:32:58.9598990Z self, 2025-05-07T20:32:58.9599077Z T: int, 2025-05-07T20:32:58.9599153Z D: int, 2025-05-07T20:32:58.9599253Z scale_ub: Optional[float], 2025-05-07T20:32:58.9599352Z contiguous: bool, 2025-05-07T20:32:58.9599441Z compiled: bool, 2025-05-07T20:32:58.9599601Z ) -> None: 2025-05-07T20:32:58.9599705Z torch.manual_seed(2025) 2025-05-07T20:32:58.9599779Z 2025-05-07T20:32:58.9599951Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.9600032Z 2025-05-07T20:32:58.9600128Z x_sign = torch.sign(x) 2025-05-07T20:32:58.9600260Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:58.9600351Z x = x_sign * x_clamp 2025-05-07T20:32:58.9600435Z x0 = x[:, :D] 2025-05-07T20:32:58.9600527Z x1 = x[:, D:] 2025-05-07T20:32:58.9600603Z 2025-05-07T20:32:58.9600693Z if contiguous: 2025-05-07T20:32:58.9600798Z x0 = x0.contiguous() 2025-05-07T20:32:58.9600888Z x1 = x1.contiguous() 2025-05-07T20:32:58.9600960Z 2025-05-07T20:32:58.9601060Z if scale_ub is not None: 2025-05-07T20:32:58.9601166Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:58.9601305Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:58.9601395Z ) 2025-05-07T20:32:58.9601472Z else: 2025-05-07T20:32:58.9601580Z scale_ub_tensor = None 2025-05-07T20:32:58.9601652Z 2025-05-07T20:32:58.9601785Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:58.9601888Z op = silu_mul_quant 2025-05-07T20:32:58.9601974Z if compiled: 2025-05-07T20:32:58.9602076Z op = torch.compile(op) 2025-05-07T20:32:58.9602236Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.9602310Z 2025-05-07T20:32:58.9602401Z > y_fp8, y_scale = fn() 2025-05-07T20:32:58.9602472Z 2025-05-07T20:32:58.9602580Z moe/activation_test.py:117: 2025-05-07T20:32:58.9602709Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.9602816Z moe/activation_test.py:115: in fn 2025-05-07T20:32:58.9602918Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.9603294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:58.9603401Z return fn(*args, **kwargs) 2025-05-07T20:32:58.9603892Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:58.9603990Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:58.9604356Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:58.9604587Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:58.9604944Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:58.9605043Z kernel = self.compile( 2025-05-07T20:32:58.9605429Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:58.9605656Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:58.9605790Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.9605796Z 2025-05-07T20:32:58.9606000Z self = 2025-05-07T20:32:58.9606776Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:58.9607284Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f31334d4280>} 2025-05-07T20:32:58.9608032Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:58.9608222Z context = 2025-05-07T20:32:58.9608266Z 2025-05-07T20:32:58.9608441Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:58.9608709Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:58.9608819Z module_map=module_map) 2025-05-07T20:32:58.9608991Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:58.9609093Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:58.9609169Z E ^ 2025-05-07T20:32:58.9609533Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:58.9609538Z 2025-05-07T20:32:58.9609953Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:58.9609958Z 2025-05-07T20:32:58.9610075Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.9610297Z self=, 2025-05-07T20:32:58.9610375Z T=16384, 2025-05-07T20:32:58.9610462Z D=5120, 2025-05-07T20:32:58.9610544Z scale_ub=1200.0, 2025-05-07T20:32:58.9610631Z contiguous=False, 2025-05-07T20:32:58.9610725Z compiled=False, 2025-05-07T20:32:58.9610799Z ) 2025-05-07T20:32:58.9611021Z self = 2025-05-07T20:32:58.9611245Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:58.9611250Z 2025-05-07T20:32:58.9611367Z @given( 2025-05-07T20:32:58.9611497Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.9611601Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.9611715Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.9611841Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.9611957Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.9612047Z ) 2025-05-07T20:32:58.9612297Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.9612394Z def test_silu_mul_quant( 2025-05-07T20:32:58.9612479Z self, 2025-05-07T20:32:58.9612554Z T: int, 2025-05-07T20:32:58.9612632Z D: int, 2025-05-07T20:32:58.9612743Z scale_ub: Optional[float], 2025-05-07T20:32:58.9612846Z contiguous: bool, 2025-05-07T20:32:58.9612934Z compiled: bool, 2025-05-07T20:32:58.9613009Z ) -> None: 2025-05-07T20:32:58.9613108Z torch.manual_seed(2025) 2025-05-07T20:32:58.9613185Z 2025-05-07T20:32:58.9613352Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.9613433Z 2025-05-07T20:32:58.9613525Z x_sign = torch.sign(x) 2025-05-07T20:32:58.9613646Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:58.9613790Z x = x_sign * x_clamp 2025-05-07T20:32:58.9613874Z x0 = x[:, :D] 2025-05-07T20:32:58.9613954Z x1 = x[:, D:] 2025-05-07T20:32:58.9614031Z 2025-05-07T20:32:58.9614113Z if contiguous: 2025-05-07T20:32:58.9614213Z x0 = x0.contiguous() 2025-05-07T20:32:58.9614304Z x1 = x1.contiguous() 2025-05-07T20:32:58.9614377Z 2025-05-07T20:32:58.9614474Z if scale_ub is not None: 2025-05-07T20:32:58.9614579Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:58.9614716Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:58.9614794Z ) 2025-05-07T20:32:58.9614874Z else: 2025-05-07T20:32:58.9614971Z scale_ub_tensor = None 2025-05-07T20:32:58.9615049Z 2025-05-07T20:32:58.9615181Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:58.9615270Z op = silu_mul_quant 2025-05-07T20:32:58.9615364Z if compiled: 2025-05-07T20:32:58.9615467Z op = torch.compile(op) 2025-05-07T20:32:58.9615656Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.9615730Z 2025-05-07T20:32:58.9615821Z > y_fp8, y_scale = fn() 2025-05-07T20:32:58.9615826Z 2025-05-07T20:32:58.9615931Z moe/activation_test.py:117: 2025-05-07T20:32:58.9616059Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.9616160Z moe/activation_test.py:115: in fn 2025-05-07T20:32:58.9616266Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.9616780Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:58.9616877Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:58.9617247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:58.9617475Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:58.9617826Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:58.9617931Z kernel = self.compile( 2025-05-07T20:32:58.9618351Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:58.9618532Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:58.9618659Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.9618711Z 2025-05-07T20:32:58.9618921Z self = 2025-05-07T20:32:58.9619731Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:58.9620350Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f31334d6d40>} 2025-05-07T20:32:58.9621103Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:58.9621289Z context = 2025-05-07T20:32:58.9621294Z 2025-05-07T20:32:58.9621464Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:58.9621736Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:58.9621844Z module_map=module_map) 2025-05-07T20:32:58.9622014Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:58.9622114Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:58.9622196Z E ^ 2025-05-07T20:32:58.9622595Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:58.9622603Z 2025-05-07T20:32:58.9623021Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:58.9623026Z 2025-05-07T20:32:58.9623136Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.9623356Z self=, 2025-05-07T20:32:58.9623444Z T=16384, 2025-05-07T20:32:58.9623522Z D=5120, 2025-05-07T20:32:58.9623606Z scale_ub=1200.0, 2025-05-07T20:32:58.9623698Z contiguous=True, 2025-05-07T20:32:58.9623783Z compiled=True, 2025-05-07T20:32:58.9623856Z ) 2025-05-07T20:32:58.9624075Z self = 2025-05-07T20:32:58.9624248Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:58.9624253Z 2025-05-07T20:32:58.9624331Z @given( 2025-05-07T20:32:58.9624496Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.9624595Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.9624719Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.9624837Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.9624951Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.9625034Z ) 2025-05-07T20:32:58.9625279Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.9625375Z def test_silu_mul_quant( 2025-05-07T20:32:58.9625459Z self, 2025-05-07T20:32:58.9625541Z T: int, 2025-05-07T20:32:58.9625615Z D: int, 2025-05-07T20:32:58.9625722Z scale_ub: Optional[float], 2025-05-07T20:32:58.9625812Z contiguous: bool, 2025-05-07T20:32:58.9625897Z compiled: bool, 2025-05-07T20:32:58.9625982Z ) -> None: 2025-05-07T20:32:58.9626077Z torch.manual_seed(2025) 2025-05-07T20:32:58.9626150Z 2025-05-07T20:32:58.9626333Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.9626406Z 2025-05-07T20:32:58.9626509Z x_sign = torch.sign(x) 2025-05-07T20:32:58.9626632Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:58.9626720Z x = x_sign * x_clamp 2025-05-07T20:32:58.9626806Z x0 = x[:, :D] 2025-05-07T20:32:58.9626884Z x1 = x[:, D:] 2025-05-07T20:32:58.9626955Z 2025-05-07T20:32:58.9627093Z if contiguous: 2025-05-07T20:32:58.9627182Z x0 = x0.contiguous() 2025-05-07T20:32:58.9627310Z x1 = x1.contiguous() 2025-05-07T20:32:58.9627388Z 2025-05-07T20:32:58.9627480Z if scale_ub is not None: 2025-05-07T20:32:58.9627588Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:58.9627729Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:58.9627809Z ) 2025-05-07T20:32:58.9627892Z else: 2025-05-07T20:32:58.9627989Z scale_ub_tensor = None 2025-05-07T20:32:58.9628063Z 2025-05-07T20:32:58.9628198Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:58.9628289Z op = silu_mul_quant 2025-05-07T20:32:58.9628375Z if compiled: 2025-05-07T20:32:58.9628482Z op = torch.compile(op) 2025-05-07T20:32:58.9628588Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.9628661Z 2025-05-07T20:32:58.9628757Z > y_fp8, y_scale = fn() 2025-05-07T20:32:58.9628765Z 2025-05-07T20:32:58.9628861Z moe/activation_test.py:117: 2025-05-07T20:32:58.9628997Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.9629098Z moe/activation_test.py:115: in fn 2025-05-07T20:32:58.9629197Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.9629606Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:58.9629701Z return fn(*args, **kwargs) 2025-05-07T20:32:58.9630205Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:58.9630307Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:58.9630661Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:58.9630890Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:58.9631237Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:58.9631334Z kernel = self.compile( 2025-05-07T20:32:58.9631722Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:58.9631899Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:58.9632026Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.9632082Z 2025-05-07T20:32:58.9632288Z self = 2025-05-07T20:32:58.9633055Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:58.9633565Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f31334d6830>} 2025-05-07T20:32:58.9634307Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:58.9634501Z context = 2025-05-07T20:32:58.9634508Z 2025-05-07T20:32:58.9634676Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:58.9634944Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:58.9635057Z module_map=module_map) 2025-05-07T20:32:58.9635223Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:58.9635318Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:58.9635400Z E ^ 2025-05-07T20:32:58.9635795Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:58.9635838Z 2025-05-07T20:32:58.9636256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:58.9636261Z 2025-05-07T20:32:58.9636365Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.9636586Z self=, 2025-05-07T20:32:58.9636673Z T=16384, 2025-05-07T20:32:58.9636749Z D=5120, 2025-05-07T20:32:58.9636836Z scale_ub=None, 2025-05-07T20:32:58.9636922Z contiguous=False, 2025-05-07T20:32:58.9637001Z compiled=True, 2025-05-07T20:32:58.9637078Z ) 2025-05-07T20:32:58.9637294Z self = 2025-05-07T20:32:58.9637471Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:58.9637476Z 2025-05-07T20:32:58.9637563Z @given( 2025-05-07T20:32:58.9637680Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.9637784Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.9637907Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.9638025Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.9638148Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.9638221Z ) 2025-05-07T20:32:58.9638507Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.9638615Z def test_silu_mul_quant( 2025-05-07T20:32:58.9638689Z self, 2025-05-07T20:32:58.9638763Z T: int, 2025-05-07T20:32:58.9638850Z D: int, 2025-05-07T20:32:58.9638948Z scale_ub: Optional[float], 2025-05-07T20:32:58.9639036Z contiguous: bool, 2025-05-07T20:32:58.9639126Z compiled: bool, 2025-05-07T20:32:58.9639205Z ) -> None: 2025-05-07T20:32:58.9639303Z torch.manual_seed(2025) 2025-05-07T20:32:58.9639384Z 2025-05-07T20:32:58.9639554Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.9639630Z 2025-05-07T20:32:58.9639735Z x_sign = torch.sign(x) 2025-05-07T20:32:58.9639859Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:58.9639955Z x = x_sign * x_clamp 2025-05-07T20:32:58.9640036Z x0 = x[:, :D] 2025-05-07T20:32:58.9640119Z x1 = x[:, D:] 2025-05-07T20:32:58.9640199Z 2025-05-07T20:32:58.9640328Z if contiguous: 2025-05-07T20:32:58.9640420Z x0 = x0.contiguous() 2025-05-07T20:32:58.9640517Z x1 = x1.contiguous() 2025-05-07T20:32:58.9640591Z 2025-05-07T20:32:58.9640681Z if scale_ub is not None: 2025-05-07T20:32:58.9640795Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:58.9640930Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:58.9641008Z ) 2025-05-07T20:32:58.9641097Z else: 2025-05-07T20:32:58.9641194Z scale_ub_tensor = None 2025-05-07T20:32:58.9641273Z 2025-05-07T20:32:58.9641404Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:58.9641492Z op = silu_mul_quant 2025-05-07T20:32:58.9641580Z if compiled: 2025-05-07T20:32:58.9641679Z op = torch.compile(op) 2025-05-07T20:32:58.9641783Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.9641863Z 2025-05-07T20:32:58.9641952Z > y_fp8, y_scale = fn() 2025-05-07T20:32:58.9641960Z 2025-05-07T20:32:58.9642061Z moe/activation_test.py:117: 2025-05-07T20:32:58.9642194Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.9642294Z moe/activation_test.py:115: in fn 2025-05-07T20:32:58.9642399Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.9642760Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:58.9642898Z return fn(*args, **kwargs) 2025-05-07T20:32:58.9643434Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:58.9643533Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:58.9643887Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:58.9644120Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:58.9644467Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:58.9644568Z kernel = self.compile( 2025-05-07T20:32:58.9644946Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:58.9645120Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:58.9645256Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.9645261Z 2025-05-07T20:32:58.9645470Z self = 2025-05-07T20:32:58.9646302Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:58.9646802Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f31334d7760>} 2025-05-07T20:32:58.9647541Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:58.9647735Z context = 2025-05-07T20:32:58.9647743Z 2025-05-07T20:32:58.9647907Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:58.9648179Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:58.9648285Z module_map=module_map) 2025-05-07T20:32:58.9648445Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:58.9648550Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:58.9648628Z E ^ 2025-05-07T20:32:58.9649023Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:58.9649035Z 2025-05-07T20:32:58.9649450Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:58.9649454Z 2025-05-07T20:32:58.9649559Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.9649784Z self=, 2025-05-07T20:32:58.9649861Z T=2048, 2025-05-07T20:32:58.9649933Z D=5120, 2025-05-07T20:32:58.9650020Z scale_ub=None, 2025-05-07T20:32:58.9650109Z contiguous=False, 2025-05-07T20:32:58.9650190Z compiled=True, 2025-05-07T20:32:58.9650268Z ) 2025-05-07T20:32:58.9650481Z self = 2025-05-07T20:32:58.9650661Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:58.9650666Z 2025-05-07T20:32:58.9650743Z @given( 2025-05-07T20:32:58.9650860Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.9650982Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.9651097Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.9651212Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.9651331Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.9651406Z ) 2025-05-07T20:32:58.9651697Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.9651796Z def test_silu_mul_quant( 2025-05-07T20:32:58.9651910Z self, 2025-05-07T20:32:58.9651995Z T: int, 2025-05-07T20:32:58.9652070Z D: int, 2025-05-07T20:32:58.9652169Z scale_ub: Optional[float], 2025-05-07T20:32:58.9652266Z contiguous: bool, 2025-05-07T20:32:58.9652351Z compiled: bool, 2025-05-07T20:32:58.9652429Z ) -> None: 2025-05-07T20:32:58.9652531Z torch.manual_seed(2025) 2025-05-07T20:32:58.9652603Z 2025-05-07T20:32:58.9652771Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.9652849Z 2025-05-07T20:32:58.9652939Z x_sign = torch.sign(x) 2025-05-07T20:32:58.9653061Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:58.9653156Z x = x_sign * x_clamp 2025-05-07T20:32:58.9653236Z x0 = x[:, :D] 2025-05-07T20:32:58.9653327Z x1 = x[:, D:] 2025-05-07T20:32:58.9653401Z 2025-05-07T20:32:58.9653486Z if contiguous: 2025-05-07T20:32:58.9653587Z x0 = x0.contiguous() 2025-05-07T20:32:58.9653677Z x1 = x1.contiguous() 2025-05-07T20:32:58.9653750Z 2025-05-07T20:32:58.9653848Z if scale_ub is not None: 2025-05-07T20:32:58.9653954Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:58.9654144Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:58.9654230Z ) 2025-05-07T20:32:58.9654309Z else: 2025-05-07T20:32:58.9654402Z scale_ub_tensor = None 2025-05-07T20:32:58.9654481Z 2025-05-07T20:32:58.9654613Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:58.9654702Z op = silu_mul_quant 2025-05-07T20:32:58.9654794Z if compiled: 2025-05-07T20:32:58.9654892Z op = torch.compile(op) 2025-05-07T20:32:58.9655004Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.9655081Z 2025-05-07T20:32:58.9655171Z > y_fp8, y_scale = fn() 2025-05-07T20:32:58.9655175Z 2025-05-07T20:32:58.9655281Z moe/activation_test.py:117: 2025-05-07T20:32:58.9655408Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.9655508Z moe/activation_test.py:115: in fn 2025-05-07T20:32:58.9655610Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.9655977Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:58.9656145Z return fn(*args, **kwargs) 2025-05-07T20:32:58.9656632Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:58.9656731Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:58.9657090Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:58.9657309Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:58.9657655Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:58.9657755Z kernel = self.compile( 2025-05-07T20:32:58.9658136Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:58.9658316Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:58.9658444Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.9658448Z 2025-05-07T20:32:58.9658651Z self = 2025-05-07T20:32:58.9659422Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:58.9660112Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f31333783a0>} 2025-05-07T20:32:58.9660875Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:58.9661066Z context = 2025-05-07T20:32:58.9661073Z 2025-05-07T20:32:58.9661242Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:58.9661508Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:58.9661615Z module_map=module_map) 2025-05-07T20:32:58.9661780Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:58.9661877Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:58.9661958Z E ^ 2025-05-07T20:32:58.9662323Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:58.9662328Z 2025-05-07T20:32:58.9662742Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:58.9662747Z 2025-05-07T20:32:58.9662861Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.9663125Z self=, 2025-05-07T20:32:58.9663207Z T=2048, 2025-05-07T20:32:58.9663290Z D=5120, 2025-05-07T20:32:58.9663374Z scale_ub=1200.0, 2025-05-07T20:32:58.9663461Z contiguous=False, 2025-05-07T20:32:58.9663551Z compiled=True, 2025-05-07T20:32:58.9663624Z ) 2025-05-07T20:32:58.9663839Z self = 2025-05-07T20:32:58.9664021Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:58.9664029Z 2025-05-07T20:32:58.9664107Z @given( 2025-05-07T20:32:58.9664236Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.9664337Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.9664455Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.9664581Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.9664701Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.9664776Z ) 2025-05-07T20:32:58.9665071Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.9665166Z def test_silu_mul_quant( 2025-05-07T20:32:58.9665244Z self, 2025-05-07T20:32:58.9665326Z T: int, 2025-05-07T20:32:58.9665402Z D: int, 2025-05-07T20:32:58.9665509Z scale_ub: Optional[float], 2025-05-07T20:32:58.9665600Z contiguous: bool, 2025-05-07T20:32:58.9665686Z compiled: bool, 2025-05-07T20:32:58.9665774Z ) -> None: 2025-05-07T20:32:58.9665872Z torch.manual_seed(2025) 2025-05-07T20:32:58.9665949Z 2025-05-07T20:32:58.9666126Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.9666203Z 2025-05-07T20:32:58.9666296Z x_sign = torch.sign(x) 2025-05-07T20:32:58.9666427Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:58.9666518Z x = x_sign * x_clamp 2025-05-07T20:32:58.9666603Z x0 = x[:, :D] 2025-05-07T20:32:58.9666695Z x1 = x[:, D:] 2025-05-07T20:32:58.9666768Z 2025-05-07T20:32:58.9666859Z if contiguous: 2025-05-07T20:32:58.9666952Z x0 = x0.contiguous() 2025-05-07T20:32:58.9667042Z x1 = x1.contiguous() 2025-05-07T20:32:58.9667122Z 2025-05-07T20:32:58.9667213Z if scale_ub is not None: 2025-05-07T20:32:58.9667319Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:58.9667461Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:58.9667584Z ) 2025-05-07T20:32:58.9667663Z else: 2025-05-07T20:32:58.9667805Z scale_ub_tensor = None 2025-05-07T20:32:58.9667879Z 2025-05-07T20:32:58.9668011Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:58.9668109Z op = silu_mul_quant 2025-05-07T20:32:58.9668196Z if compiled: 2025-05-07T20:32:58.9668298Z op = torch.compile(op) 2025-05-07T20:32:58.9668415Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.9668491Z 2025-05-07T20:32:58.9668589Z > y_fp8, y_scale = fn() 2025-05-07T20:32:58.9668593Z 2025-05-07T20:32:58.9668692Z moe/activation_test.py:117: 2025-05-07T20:32:58.9668820Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.9668930Z moe/activation_test.py:115: in fn 2025-05-07T20:32:58.9669028Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.9669393Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:58.9669498Z return fn(*args, **kwargs) 2025-05-07T20:32:58.9669999Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:58.9670105Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:58.9670503Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:58.9670735Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:58.9671086Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:58.9671181Z kernel = self.compile( 2025-05-07T20:32:58.9671560Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:58.9671742Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:58.9671872Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.9671879Z 2025-05-07T20:32:58.9672092Z self = 2025-05-07T20:32:58.9672867Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:58.9673420Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3133378820>} 2025-05-07T20:32:58.9674158Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:58.9674354Z context = 2025-05-07T20:32:58.9674358Z 2025-05-07T20:32:58.9674531Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:58.9674796Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:58.9674909Z module_map=module_map) 2025-05-07T20:32:58.9675075Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:58.9675175Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:58.9675262Z E ^ 2025-05-07T20:32:58.9675614Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:58.9675619Z 2025-05-07T20:32:58.9676035Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:58.9676046Z 2025-05-07T20:32:58.9676151Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.9676417Z self=, 2025-05-07T20:32:58.9676500Z T=4096, 2025-05-07T20:32:58.9676639Z D=5120, 2025-05-07T20:32:58.9676725Z scale_ub=1200.0, 2025-05-07T20:32:58.9676818Z contiguous=True, 2025-05-07T20:32:58.9676901Z compiled=True, 2025-05-07T20:32:58.9676974Z ) 2025-05-07T20:32:58.9677196Z self = 2025-05-07T20:32:58.9677371Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:58.9677379Z 2025-05-07T20:32:58.9677463Z @given( 2025-05-07T20:32:58.9677583Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.9677683Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.9677802Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.9677918Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.9678032Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.9678117Z ) 2025-05-07T20:32:58.9678370Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.9678464Z def test_silu_mul_quant( 2025-05-07T20:32:58.9678547Z self, 2025-05-07T20:32:58.9678624Z T: int, 2025-05-07T20:32:58.9678701Z D: int, 2025-05-07T20:32:58.9678807Z scale_ub: Optional[float], 2025-05-07T20:32:58.9678897Z contiguous: bool, 2025-05-07T20:32:58.9679032Z compiled: bool, 2025-05-07T20:32:58.9679116Z ) -> None: 2025-05-07T20:32:58.9679211Z torch.manual_seed(2025) 2025-05-07T20:32:58.9679291Z 2025-05-07T20:32:58.9679462Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.9679540Z 2025-05-07T20:32:58.9679640Z x_sign = torch.sign(x) 2025-05-07T20:32:58.9679765Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:58.9679854Z x = x_sign * x_clamp 2025-05-07T20:32:58.9679943Z x0 = x[:, :D] 2025-05-07T20:32:58.9680025Z x1 = x[:, D:] 2025-05-07T20:32:58.9680100Z 2025-05-07T20:32:58.9680194Z if contiguous: 2025-05-07T20:32:58.9680287Z x0 = x0.contiguous() 2025-05-07T20:32:58.9680383Z x1 = x1.contiguous() 2025-05-07T20:32:58.9680454Z 2025-05-07T20:32:58.9680550Z if scale_ub is not None: 2025-05-07T20:32:58.9680664Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:58.9680801Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:58.9680925Z ) 2025-05-07T20:32:58.9681011Z else: 2025-05-07T20:32:58.9681106Z scale_ub_tensor = None 2025-05-07T20:32:58.9681181Z 2025-05-07T20:32:58.9681318Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:58.9681410Z op = silu_mul_quant 2025-05-07T20:32:58.9681497Z if compiled: 2025-05-07T20:32:58.9681605Z op = torch.compile(op) 2025-05-07T20:32:58.9681717Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.9681790Z 2025-05-07T20:32:58.9681888Z > y_fp8, y_scale = fn() 2025-05-07T20:32:58.9681897Z 2025-05-07T20:32:58.9681998Z moe/activation_test.py:117: 2025-05-07T20:32:58.9682132Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.9682233Z moe/activation_test.py:115: in fn 2025-05-07T20:32:58.9682331Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.9682707Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:58.9682807Z return fn(*args, **kwargs) 2025-05-07T20:32:58.9683297Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:58.9683403Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:58.9683759Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:58.9684037Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:58.9684417Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:58.9684513Z kernel = self.compile( 2025-05-07T20:32:58.9684904Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:58.9685081Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:58.9685216Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.9685221Z 2025-05-07T20:32:58.9685424Z self = 2025-05-07T20:32:58.9686192Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:58.9686698Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3133379360>} 2025-05-07T20:32:58.9687477Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:58.9687676Z context = 2025-05-07T20:32:58.9687680Z 2025-05-07T20:32:58.9687848Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:58.9688140Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:58.9688266Z module_map=module_map) 2025-05-07T20:32:58.9688445Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:58.9688555Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:58.9688632Z E ^ 2025-05-07T20:32:58.9688988Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:58.9688993Z 2025-05-07T20:32:58.9689418Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:58.9689426Z 2025-05-07T20:32:58.9689530Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.9690572Z self=, 2025-05-07T20:32:58.9690655Z T=128, 2025-05-07T20:32:58.9690733Z D=5120, 2025-05-07T20:32:58.9690824Z scale_ub=1200.0, 2025-05-07T20:32:58.9690912Z contiguous=False, 2025-05-07T20:32:58.9690996Z compiled=True, 2025-05-07T20:32:58.9691079Z ) 2025-05-07T20:32:58.9691293Z self = 2025-05-07T20:32:58.9691470Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:58.9691475Z 2025-05-07T20:32:58.9691562Z @given( 2025-05-07T20:32:58.9691685Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.9691790Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.9691906Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.9692029Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.9692149Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.9692234Z ) 2025-05-07T20:32:58.9692483Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.9692585Z def test_silu_mul_quant( 2025-05-07T20:32:58.9692663Z self, 2025-05-07T20:32:58.9692739Z T: int, 2025-05-07T20:32:58.9692826Z D: int, 2025-05-07T20:32:58.9692927Z scale_ub: Optional[float], 2025-05-07T20:32:58.9693018Z contiguous: bool, 2025-05-07T20:32:58.9693208Z compiled: bool, 2025-05-07T20:32:58.9693289Z ) -> None: 2025-05-07T20:32:58.9693448Z torch.manual_seed(2025) 2025-05-07T20:32:58.9693524Z 2025-05-07T20:32:58.9693694Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.9693778Z 2025-05-07T20:32:58.9693873Z x_sign = torch.sign(x) 2025-05-07T20:32:58.9693999Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:58.9694097Z x = x_sign * x_clamp 2025-05-07T20:32:58.9694185Z x0 = x[:, :D] 2025-05-07T20:32:58.9694266Z x1 = x[:, D:] 2025-05-07T20:32:58.9694348Z 2025-05-07T20:32:58.9694434Z if contiguous: 2025-05-07T20:32:58.9694527Z x0 = x0.contiguous() 2025-05-07T20:32:58.9694622Z x1 = x1.contiguous() 2025-05-07T20:32:58.9694695Z 2025-05-07T20:32:58.9694787Z if scale_ub is not None: 2025-05-07T20:32:58.9694901Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:58.9695039Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:58.9695123Z ) 2025-05-07T20:32:58.9695205Z else: 2025-05-07T20:32:58.9695301Z scale_ub_tensor = None 2025-05-07T20:32:58.9695381Z 2025-05-07T20:32:58.9695513Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:58.9695607Z op = silu_mul_quant 2025-05-07T20:32:58.9695700Z if compiled: 2025-05-07T20:32:58.9695867Z op = torch.compile(op) 2025-05-07T20:32:58.9695980Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.9696063Z 2025-05-07T20:32:58.9696156Z > y_fp8, y_scale = fn() 2025-05-07T20:32:58.9696160Z 2025-05-07T20:32:58.9696269Z moe/activation_test.py:117: 2025-05-07T20:32:58.9696399Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.9696501Z moe/activation_test.py:115: in fn 2025-05-07T20:32:58.9696609Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.9696984Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:58.9697080Z return fn(*args, **kwargs) 2025-05-07T20:32:58.9697586Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:58.9697685Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:58.9698050Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:58.9698342Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:58.9698686Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:58.9698786Z kernel = self.compile( 2025-05-07T20:32:58.9699173Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:58.9699350Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:58.9699487Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.9699492Z 2025-05-07T20:32:58.9699696Z self = 2025-05-07T20:32:58.9700591Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:58.9701099Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f313337a290>} 2025-05-07T20:32:58.9701845Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:58.9702083Z context = 2025-05-07T20:32:58.9702125Z 2025-05-07T20:32:58.9702294Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:58.9702570Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:58.9702677Z module_map=module_map) 2025-05-07T20:32:58.9702841Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:58.9702952Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:58.9703030Z E ^ 2025-05-07T20:32:58.9703390Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:58.9703395Z 2025-05-07T20:32:58.9703806Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:58.9703813Z 2025-05-07T20:32:58.9703917Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.9704150Z self=, 2025-05-07T20:32:58.9704230Z T=16384, 2025-05-07T20:32:58.9704312Z D=7168, 2025-05-07T20:32:58.9704396Z scale_ub=1200.0, 2025-05-07T20:32:58.9704481Z contiguous=True, 2025-05-07T20:32:58.9704570Z compiled=True, 2025-05-07T20:32:58.9704645Z ) 2025-05-07T20:32:58.9704903Z self = 2025-05-07T20:32:58.9705090Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:58.9705095Z 2025-05-07T20:32:58.9705173Z @given( 2025-05-07T20:32:58.9705291Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.9705395Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.9705514Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.9705637Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.9705756Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.9705831Z ) 2025-05-07T20:32:58.9706087Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.9706183Z def test_silu_mul_quant( 2025-05-07T20:32:58.9706262Z self, 2025-05-07T20:32:58.9706345Z T: int, 2025-05-07T20:32:58.9706422Z D: int, 2025-05-07T20:32:58.9706524Z scale_ub: Optional[float], 2025-05-07T20:32:58.9706620Z contiguous: bool, 2025-05-07T20:32:58.9706752Z compiled: bool, 2025-05-07T20:32:58.9706832Z ) -> None: 2025-05-07T20:32:58.9706932Z torch.manual_seed(2025) 2025-05-07T20:32:58.9707006Z 2025-05-07T20:32:58.9707181Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.9707256Z 2025-05-07T20:32:58.9707347Z x_sign = torch.sign(x) 2025-05-07T20:32:58.9707479Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:58.9707572Z x = x_sign * x_clamp 2025-05-07T20:32:58.9707654Z x0 = x[:, :D] 2025-05-07T20:32:58.9707744Z x1 = x[:, D:] 2025-05-07T20:32:58.9707818Z 2025-05-07T20:32:58.9707903Z if contiguous: 2025-05-07T20:32:58.9708003Z x0 = x0.contiguous() 2025-05-07T20:32:58.9708094Z x1 = x1.contiguous() 2025-05-07T20:32:58.9708166Z 2025-05-07T20:32:58.9708263Z if scale_ub is not None: 2025-05-07T20:32:58.9708373Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:58.9708511Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:58.9708595Z ) 2025-05-07T20:32:58.9708674Z else: 2025-05-07T20:32:58.9708775Z scale_ub_tensor = None 2025-05-07T20:32:58.9708848Z 2025-05-07T20:32:58.9708977Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:58.9709073Z op = silu_mul_quant 2025-05-07T20:32:58.9709160Z if compiled: 2025-05-07T20:32:58.9709332Z op = torch.compile(op) 2025-05-07T20:32:58.9709445Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.9709561Z 2025-05-07T20:32:58.9709656Z > y_fp8, y_scale = fn() 2025-05-07T20:32:58.9709660Z 2025-05-07T20:32:58.9709766Z moe/activation_test.py:117: 2025-05-07T20:32:58.9709895Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.9710005Z moe/activation_test.py:115: in fn 2025-05-07T20:32:58.9710109Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.9714689Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:58.9714801Z return fn(*args, **kwargs) 2025-05-07T20:32:58.9715311Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:58.9715413Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:58.9715787Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:58.9716025Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:58.9716376Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:58.9716476Z kernel = self.compile( 2025-05-07T20:32:58.9716940Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:58.9717132Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:58.9717274Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.9717279Z 2025-05-07T20:32:58.9717490Z self = 2025-05-07T20:32:58.9718315Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:58.9718828Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f313337ad40>} 2025-05-07T20:32:58.9719577Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:58.9719818Z context = 2025-05-07T20:32:58.9719823Z 2025-05-07T20:32:58.9719995Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:58.9720269Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:58.9720380Z module_map=module_map) 2025-05-07T20:32:58.9720549Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:58.9720656Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:58.9720733Z E ^ 2025-05-07T20:32:58.9721087Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:58.9721092Z 2025-05-07T20:32:58.9721511Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:58.9721518Z 2025-05-07T20:32:58.9721620Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.9721844Z self=, 2025-05-07T20:32:58.9721923Z T=16384, 2025-05-07T20:32:58.9721999Z D=5120, 2025-05-07T20:32:58.9722084Z scale_ub=1200.0, 2025-05-07T20:32:58.9722167Z contiguous=True, 2025-05-07T20:32:58.9722246Z compiled=False, 2025-05-07T20:32:58.9722324Z ) 2025-05-07T20:32:58.9722584Z self = 2025-05-07T20:32:58.9722802Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:58.9722814Z 2025-05-07T20:32:58.9722892Z @given( 2025-05-07T20:32:58.9723011Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.9723116Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.9723233Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.9723351Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.9723477Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.9723549Z ) 2025-05-07T20:32:58.9723796Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.9723896Z def test_silu_mul_quant( 2025-05-07T20:32:58.9723972Z self, 2025-05-07T20:32:58.9724049Z T: int, 2025-05-07T20:32:58.9724129Z D: int, 2025-05-07T20:32:58.9724224Z scale_ub: Optional[float], 2025-05-07T20:32:58.9724325Z contiguous: bool, 2025-05-07T20:32:58.9724409Z compiled: bool, 2025-05-07T20:32:58.9724488Z ) -> None: 2025-05-07T20:32:58.9724588Z torch.manual_seed(2025) 2025-05-07T20:32:58.9724661Z 2025-05-07T20:32:58.9724828Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.9724909Z 2025-05-07T20:32:58.9725000Z x_sign = torch.sign(x) 2025-05-07T20:32:58.9725165Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:58.9725264Z x = x_sign * x_clamp 2025-05-07T20:32:58.9725344Z x0 = x[:, :D] 2025-05-07T20:32:58.9725420Z x1 = x[:, D:] 2025-05-07T20:32:58.9725499Z 2025-05-07T20:32:58.9725583Z if contiguous: 2025-05-07T20:32:58.9725674Z x0 = x0.contiguous() 2025-05-07T20:32:58.9725768Z x1 = x1.contiguous() 2025-05-07T20:32:58.9725841Z 2025-05-07T20:32:58.9725934Z if scale_ub is not None: 2025-05-07T20:32:58.9726045Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:58.9726184Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:58.9726270Z ) 2025-05-07T20:32:58.9726348Z else: 2025-05-07T20:32:58.9726443Z scale_ub_tensor = None 2025-05-07T20:32:58.9726518Z 2025-05-07T20:32:58.9726647Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:58.9726742Z op = silu_mul_quant 2025-05-07T20:32:58.9726828Z if compiled: 2025-05-07T20:32:58.9726970Z op = torch.compile(op) 2025-05-07T20:32:58.9727077Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.9727154Z 2025-05-07T20:32:58.9727245Z > y_fp8, y_scale = fn() 2025-05-07T20:32:58.9727250Z 2025-05-07T20:32:58.9727352Z moe/activation_test.py:117: 2025-05-07T20:32:58.9727478Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.9727576Z moe/activation_test.py:115: in fn 2025-05-07T20:32:58.9727679Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.9728181Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:58.9728279Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:58.9728638Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:58.9728859Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:58.9729208Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:58.9729300Z kernel = self.compile( 2025-05-07T20:32:58.9729681Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:58.9729859Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:58.9730033Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.9730038Z 2025-05-07T20:32:58.9730285Z self = 2025-05-07T20:32:58.9731071Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:58.9731563Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f313337bac0>} 2025-05-07T20:32:58.9732308Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:58.9732499Z context = 2025-05-07T20:32:58.9732507Z 2025-05-07T20:32:58.9732681Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:58.9732943Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:58.9733048Z module_map=module_map) 2025-05-07T20:32:58.9733219Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:58.9733358Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:58.9733444Z E ^ 2025-05-07T20:32:58.9733806Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:58.9733811Z 2025-05-07T20:32:58.9734220Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:58.9734225Z 2025-05-07T20:32:58.9734334Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.9734554Z self=, 2025-05-07T20:32:58.9734633Z T=1, 2025-05-07T20:32:58.9734716Z D=7168, 2025-05-07T20:32:58.9734800Z scale_ub=1200.0, 2025-05-07T20:32:58.9734888Z contiguous=False, 2025-05-07T20:32:58.9734972Z compiled=False, 2025-05-07T20:32:58.9735043Z ) 2025-05-07T20:32:58.9735263Z self = 2025-05-07T20:32:58.9735437Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:58.9735482Z 2025-05-07T20:32:58.9735555Z @given( 2025-05-07T20:32:58.9735681Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.9735778Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.9735894Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.9736016Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.9736133Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.9736214Z ) 2025-05-07T20:32:58.9736464Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.9736561Z def test_silu_mul_quant( 2025-05-07T20:32:58.9736640Z self, 2025-05-07T20:32:58.9736713Z T: int, 2025-05-07T20:32:58.9736786Z D: int, 2025-05-07T20:32:58.9736890Z scale_ub: Optional[float], 2025-05-07T20:32:58.9736980Z contiguous: bool, 2025-05-07T20:32:58.9737062Z compiled: bool, 2025-05-07T20:32:58.9737145Z ) -> None: 2025-05-07T20:32:58.9737244Z torch.manual_seed(2025) 2025-05-07T20:32:58.9737313Z 2025-05-07T20:32:58.9737482Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.9737556Z 2025-05-07T20:32:58.9737656Z x_sign = torch.sign(x) 2025-05-07T20:32:58.9737778Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:58.9737870Z x = x_sign * x_clamp 2025-05-07T20:32:58.9737946Z x0 = x[:, :D] 2025-05-07T20:32:58.9738079Z x1 = x[:, D:] 2025-05-07T20:32:58.9738149Z 2025-05-07T20:32:58.9738235Z if contiguous: 2025-05-07T20:32:58.9738368Z x0 = x0.contiguous() 2025-05-07T20:32:58.9738456Z x1 = x1.contiguous() 2025-05-07T20:32:58.9738524Z 2025-05-07T20:32:58.9738617Z if scale_ub is not None: 2025-05-07T20:32:58.9738721Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:58.9738857Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:58.9738938Z ) 2025-05-07T20:32:58.9739016Z else: 2025-05-07T20:32:58.9739108Z scale_ub_tensor = None 2025-05-07T20:32:58.9739183Z 2025-05-07T20:32:58.9739312Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:58.9739398Z op = silu_mul_quant 2025-05-07T20:32:58.9739486Z if compiled: 2025-05-07T20:32:58.9739586Z op = torch.compile(op) 2025-05-07T20:32:58.9739696Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.9739868Z 2025-05-07T20:32:58.9739960Z > y_fp8, y_scale = fn() 2025-05-07T20:32:58.9739965Z 2025-05-07T20:32:58.9740068Z moe/activation_test.py:117: 2025-05-07T20:32:58.9740194Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.9740295Z moe/activation_test.py:115: in fn 2025-05-07T20:32:58.9740395Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.9740934Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:58.9741041Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:58.9741398Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:58.9741617Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:58.9741960Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:58.9742057Z kernel = self.compile( 2025-05-07T20:32:58.9742446Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:58.9742627Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:58.9742749Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.9742754Z 2025-05-07T20:32:58.9742963Z self = 2025-05-07T20:32:58.9743816Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:58.9744316Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3132f9c4c0>} 2025-05-07T20:32:58.9745059Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:58.9745244Z context = 2025-05-07T20:32:58.9745249Z 2025-05-07T20:32:58.9745421Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:58.9745684Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:58.9745794Z module_map=module_map) 2025-05-07T20:32:58.9745955Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:58.9746049Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:58.9746129Z E ^ 2025-05-07T20:32:58.9746478Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:58.9746527Z 2025-05-07T20:32:58.9746980Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:58.9746985Z 2025-05-07T20:32:58.9747093Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.9747312Z self=, 2025-05-07T20:32:58.9747387Z T=4096, 2025-05-07T20:32:58.9747460Z D=7168, 2025-05-07T20:32:58.9747545Z scale_ub=1200.0, 2025-05-07T20:32:58.9747638Z contiguous=False, 2025-05-07T20:32:58.9747716Z compiled=True, 2025-05-07T20:32:58.9747786Z ) 2025-05-07T20:32:58.9748031Z self = 2025-05-07T20:32:58.9748229Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:58.9748234Z 2025-05-07T20:32:58.9748304Z @given( 2025-05-07T20:32:58.9748428Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.9748532Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.9748652Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.9748768Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.9748879Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.9748955Z ) 2025-05-07T20:32:58.9749197Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.9749332Z def test_silu_mul_quant( 2025-05-07T20:32:58.9749416Z self, 2025-05-07T20:32:58.9749487Z T: int, 2025-05-07T20:32:58.9749558Z D: int, 2025-05-07T20:32:58.9749657Z scale_ub: Optional[float], 2025-05-07T20:32:58.9749744Z contiguous: bool, 2025-05-07T20:32:58.9749825Z compiled: bool, 2025-05-07T20:32:58.9749908Z ) -> None: 2025-05-07T20:32:58.9750000Z torch.manual_seed(2025) 2025-05-07T20:32:58.9750070Z 2025-05-07T20:32:58.9750237Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.9750312Z 2025-05-07T20:32:58.9750404Z x_sign = torch.sign(x) 2025-05-07T20:32:58.9750530Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:58.9750617Z x = x_sign * x_clamp 2025-05-07T20:32:58.9750695Z x0 = x[:, :D] 2025-05-07T20:32:58.9750771Z x1 = x[:, D:] 2025-05-07T20:32:58.9750842Z 2025-05-07T20:32:58.9750930Z if contiguous: 2025-05-07T20:32:58.9751027Z x0 = x0.contiguous() 2025-05-07T20:32:58.9751157Z x1 = x1.contiguous() 2025-05-07T20:32:58.9751233Z 2025-05-07T20:32:58.9751319Z if scale_ub is not None: 2025-05-07T20:32:58.9751428Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:58.9751562Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:58.9751635Z ) 2025-05-07T20:32:58.9751714Z else: 2025-05-07T20:32:58.9751804Z scale_ub_tensor = None 2025-05-07T20:32:58.9751878Z 2025-05-07T20:32:58.9752008Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:58.9752095Z op = silu_mul_quant 2025-05-07T20:32:58.9752182Z if compiled: 2025-05-07T20:32:58.9752286Z op = torch.compile(op) 2025-05-07T20:32:58.9752389Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.9752461Z 2025-05-07T20:32:58.9752558Z > y_fp8, y_scale = fn() 2025-05-07T20:32:58.9752563Z 2025-05-07T20:32:58.9752662Z moe/activation_test.py:117: 2025-05-07T20:32:58.9752793Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.9752894Z moe/activation_test.py:115: in fn 2025-05-07T20:32:58.9752992Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.9753361Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:58.9753453Z return fn(*args, **kwargs) 2025-05-07T20:32:58.9753950Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:58.9754133Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:58.9754493Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:58.9754719Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:58.9755061Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:58.9755158Z kernel = self.compile( 2025-05-07T20:32:58.9755548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:58.9755721Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:58.9755846Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.9755857Z 2025-05-07T20:32:58.9756061Z self = 2025-05-07T20:32:58.9756828Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:58.9757371Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3132f9d1b0>} 2025-05-07T20:32:58.9758122Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:58.9758316Z context = 2025-05-07T20:32:58.9758320Z 2025-05-07T20:32:58.9758484Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:58.9758749Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:58.9758863Z module_map=module_map) 2025-05-07T20:32:58.9759020Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:58.9759120Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:58.9759196Z E ^ 2025-05-07T20:32:58.9759548Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:58.9759592Z 2025-05-07T20:32:58.9760011Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:58.9760015Z 2025-05-07T20:32:58.9760115Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.9760331Z self=, 2025-05-07T20:32:58.9760413Z T=128, 2025-05-07T20:32:58.9760488Z D=7168, 2025-05-07T20:32:58.9760567Z scale_ub=1200.0, 2025-05-07T20:32:58.9760648Z contiguous=False, 2025-05-07T20:32:58.9760730Z compiled=True, 2025-05-07T20:32:58.9760802Z ) 2025-05-07T20:32:58.9761012Z self = 2025-05-07T20:32:58.9761180Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:58.9761184Z 2025-05-07T20:32:58.9761264Z @given( 2025-05-07T20:32:58.9761382Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.9761479Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.9761599Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.9761713Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.9761827Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.9761899Z ) 2025-05-07T20:32:58.9762146Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.9762285Z def test_silu_mul_quant( 2025-05-07T20:32:58.9762355Z self, 2025-05-07T20:32:58.9762427Z T: int, 2025-05-07T20:32:58.9762545Z D: int, 2025-05-07T20:32:58.9762647Z scale_ub: Optional[float], 2025-05-07T20:32:58.9762737Z contiguous: bool, 2025-05-07T20:32:58.9762823Z compiled: bool, 2025-05-07T20:32:58.9762900Z ) -> None: 2025-05-07T20:32:58.9762990Z torch.manual_seed(2025) 2025-05-07T20:32:58.9763063Z 2025-05-07T20:32:58.9763231Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.9763312Z 2025-05-07T20:32:58.9763402Z x_sign = torch.sign(x) 2025-05-07T20:32:58.9763525Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:58.9763615Z x = x_sign * x_clamp 2025-05-07T20:32:58.9763693Z x0 = x[:, :D] 2025-05-07T20:32:58.9763772Z x1 = x[:, D:] 2025-05-07T20:32:58.9763842Z 2025-05-07T20:32:58.9763923Z if contiguous: 2025-05-07T20:32:58.9764016Z x0 = x0.contiguous() 2025-05-07T20:32:58.9764108Z x1 = x1.contiguous() 2025-05-07T20:32:58.9764177Z 2025-05-07T20:32:58.9764267Z if scale_ub is not None: 2025-05-07T20:32:58.9764375Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:58.9764507Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:58.9764585Z ) 2025-05-07T20:32:58.9764659Z else: 2025-05-07T20:32:58.9764799Z scale_ub_tensor = None 2025-05-07T20:32:58.9764876Z 2025-05-07T20:32:58.9765003Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:58.9765089Z op = silu_mul_quant 2025-05-07T20:32:58.9765178Z if compiled: 2025-05-07T20:32:58.9765276Z op = torch.compile(op) 2025-05-07T20:32:58.9765378Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.9765448Z 2025-05-07T20:32:58.9765536Z > y_fp8, y_scale = fn() 2025-05-07T20:32:58.9765544Z 2025-05-07T20:32:58.9765640Z moe/activation_test.py:117: 2025-05-07T20:32:58.9765771Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.9765871Z moe/activation_test.py:115: in fn 2025-05-07T20:32:58.9765976Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.9766337Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:58.9766433Z return fn(*args, **kwargs) 2025-05-07T20:32:58.9766941Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:58.9767081Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:58.9767443Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:58.9767663Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:58.9768000Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:58.9768095Z kernel = self.compile( 2025-05-07T20:32:58.9768474Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:58.9768650Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:58.9768779Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.9768783Z 2025-05-07T20:32:58.9768988Z self = 2025-05-07T20:32:58.9769755Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:58.9770247Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3132f9c0d0>} 2025-05-07T20:32:58.9771079Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:58.9771274Z context = 2025-05-07T20:32:58.9771278Z 2025-05-07T20:32:58.9771445Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:58.9771717Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:58.9771823Z module_map=module_map) 2025-05-07T20:32:58.9771981Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:58.9772083Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:58.9772160Z E ^ 2025-05-07T20:32:58.9772516Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:58.9772523Z 2025-05-07T20:32:58.9772946Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:58.9772951Z 2025-05-07T20:32:58.9773052Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.9773275Z self=, 2025-05-07T20:32:58.9773410Z T=2048, 2025-05-07T20:32:58.9773487Z D=7168, 2025-05-07T20:32:58.9773570Z scale_ub=None, 2025-05-07T20:32:58.9773650Z contiguous=True, 2025-05-07T20:32:58.9773739Z compiled=True, 2025-05-07T20:32:58.9773810Z ) 2025-05-07T20:32:58.9774022Z self = 2025-05-07T20:32:58.9774196Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:58.9774201Z 2025-05-07T20:32:58.9774272Z @given( 2025-05-07T20:32:58.9774387Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.9774486Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.9774603Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.9774717Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.9774834Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.9774908Z ) 2025-05-07T20:32:58.9775155Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.9775291Z def test_silu_mul_quant( 2025-05-07T20:32:58.9775366Z self, 2025-05-07T20:32:58.9775444Z T: int, 2025-05-07T20:32:58.9775516Z D: int, 2025-05-07T20:32:58.9775613Z scale_ub: Optional[float], 2025-05-07T20:32:58.9775709Z contiguous: bool, 2025-05-07T20:32:58.9775792Z compiled: bool, 2025-05-07T20:32:58.9775867Z ) -> None: 2025-05-07T20:32:58.9775964Z torch.manual_seed(2025) 2025-05-07T20:32:58.9776038Z 2025-05-07T20:32:58.9776204Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.9776281Z 2025-05-07T20:32:58.9776375Z x_sign = torch.sign(x) 2025-05-07T20:32:58.9776501Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:58.9776587Z x = x_sign * x_clamp 2025-05-07T20:32:58.9776665Z x0 = x[:, :D] 2025-05-07T20:32:58.9776750Z x1 = x[:, D:] 2025-05-07T20:32:58.9776822Z 2025-05-07T20:32:58.9776908Z if contiguous: 2025-05-07T20:32:58.9777006Z x0 = x0.contiguous() 2025-05-07T20:32:58.9777092Z x1 = x1.contiguous() 2025-05-07T20:32:58.9777163Z 2025-05-07T20:32:58.9777254Z if scale_ub is not None: 2025-05-07T20:32:58.9777358Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:58.9777488Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:58.9777564Z ) 2025-05-07T20:32:58.9777638Z else: 2025-05-07T20:32:58.9777777Z scale_ub_tensor = None 2025-05-07T20:32:58.9777850Z 2025-05-07T20:32:58.9778018Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:58.9778114Z op = silu_mul_quant 2025-05-07T20:32:58.9778196Z if compiled: 2025-05-07T20:32:58.9778294Z op = torch.compile(op) 2025-05-07T20:32:58.9778402Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.9778472Z 2025-05-07T20:32:58.9778561Z > y_fp8, y_scale = fn() 2025-05-07T20:32:58.9778568Z 2025-05-07T20:32:58.9778670Z moe/activation_test.py:117: 2025-05-07T20:32:58.9778794Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.9778895Z moe/activation_test.py:115: in fn 2025-05-07T20:32:58.9778996Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.9779357Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:58.9779457Z return fn(*args, **kwargs) 2025-05-07T20:32:58.9780042Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:58.9780140Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:58.9780508Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:58.9780775Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:58.9781119Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:58.9781218Z kernel = self.compile( 2025-05-07T20:32:58.9781595Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:58.9781771Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:58.9781894Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.9781902Z 2025-05-07T20:32:58.9782106Z self = 2025-05-07T20:32:58.9782884Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:58.9783380Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3132f9e560>} 2025-05-07T20:32:58.9784173Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:58.9784359Z context = 2025-05-07T20:32:58.9784364Z 2025-05-07T20:32:58.9784537Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:58.9784802Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:58.9784904Z module_map=module_map) 2025-05-07T20:32:58.9785070Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:58.9785165Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:58.9785241Z E ^ 2025-05-07T20:32:58.9785599Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:58.9785606Z 2025-05-07T20:32:58.9786019Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:58.9786024Z 2025-05-07T20:32:58.9786129Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.9786344Z self=, 2025-05-07T20:32:58.9786460Z T=16384, 2025-05-07T20:32:58.9786536Z D=5120, 2025-05-07T20:32:58.9786613Z scale_ub=None, 2025-05-07T20:32:58.9786735Z contiguous=False, 2025-05-07T20:32:58.9786824Z compiled=False, 2025-05-07T20:32:58.9786891Z ) 2025-05-07T20:32:58.9787102Z self = 2025-05-07T20:32:58.9787282Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:58.9787289Z 2025-05-07T20:32:58.9787363Z @given( 2025-05-07T20:32:58.9787485Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.9787581Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.9787694Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.9787814Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.9787928Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.9788001Z ) 2025-05-07T20:32:58.9788252Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.9788346Z def test_silu_mul_quant( 2025-05-07T20:32:58.9788420Z self, 2025-05-07T20:32:58.9788495Z T: int, 2025-05-07T20:32:58.9788565Z D: int, 2025-05-07T20:32:58.9788672Z scale_ub: Optional[float], 2025-05-07T20:32:58.9788764Z contiguous: bool, 2025-05-07T20:32:58.9788846Z compiled: bool, 2025-05-07T20:32:58.9788929Z ) -> None: 2025-05-07T20:32:58.9789067Z torch.manual_seed(2025) 2025-05-07T20:32:58.9789142Z 2025-05-07T20:32:58.9789316Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.9789389Z 2025-05-07T20:32:58.9789479Z x_sign = torch.sign(x) 2025-05-07T20:32:58.9789604Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:58.9792155Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:58.9792179Z 2025-05-07T20:32:58.9792329Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:58.9792487Z 2025-05-07T20:32:58.9792595Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.9792826Z self=, 2025-05-07T20:32:58.9792903Z T=4096, 2025-05-07T20:32:58.9792978Z D=7168, 2025-05-07T20:32:58.9793066Z scale_ub=1200.0, 2025-05-07T20:32:58.9793148Z contiguous=True, 2025-05-07T20:32:58.9793228Z compiled=True, 2025-05-07T20:32:58.9793308Z ) 2025-05-07T20:32:58.9793525Z self = 2025-05-07T20:32:58.9793701Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:58.9793712Z 2025-05-07T20:32:58.9793785Z @given( 2025-05-07T20:32:58.9793903Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.9794002Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.9794114Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.9794229Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.9794352Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.9794423Z ) 2025-05-07T20:32:58.9794669Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.9794769Z def test_silu_mul_quant( 2025-05-07T20:32:58.9794845Z self, 2025-05-07T20:32:58.9794916Z T: int, 2025-05-07T20:32:58.9794991Z D: int, 2025-05-07T20:32:58.9795085Z scale_ub: Optional[float], 2025-05-07T20:32:58.9795256Z contiguous: bool, 2025-05-07T20:32:58.9795339Z compiled: bool, 2025-05-07T20:32:58.9795474Z ) -> None: 2025-05-07T20:32:58.9795574Z torch.manual_seed(2025) 2025-05-07T20:32:58.9795648Z 2025-05-07T20:32:58.9795816Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.9795888Z 2025-05-07T20:32:58.9795977Z x_sign = torch.sign(x) 2025-05-07T20:32:58.9796104Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:58.9797880Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:58.9797890Z 2025-05-07T20:32:58.9798011Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:58.9798016Z 2025-05-07T20:32:58.9798119Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.9798335Z self=, 2025-05-07T20:32:58.9798417Z T=16384, 2025-05-07T20:32:58.9798551Z D=7168, 2025-05-07T20:32:58.9798637Z scale_ub=None, 2025-05-07T20:32:58.9798725Z contiguous=False, 2025-05-07T20:32:58.9798808Z compiled=False, 2025-05-07T20:32:58.9798880Z ) 2025-05-07T20:32:58.9799096Z self = 2025-05-07T20:32:58.9799267Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:58.9799271Z 2025-05-07T20:32:58.9799346Z @given( 2025-05-07T20:32:58.9799469Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.9799567Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.9799686Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.9799803Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.9799913Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.9799991Z ) 2025-05-07T20:32:58.9800238Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.9800330Z def test_silu_mul_quant( 2025-05-07T20:32:58.9800456Z self, 2025-05-07T20:32:58.9800532Z T: int, 2025-05-07T20:32:58.9800604Z D: int, 2025-05-07T20:32:58.9800708Z scale_ub: Optional[float], 2025-05-07T20:32:58.9800800Z contiguous: bool, 2025-05-07T20:32:58.9800882Z compiled: bool, 2025-05-07T20:32:58.9800961Z ) -> None: 2025-05-07T20:32:58.9801051Z torch.manual_seed(2025) 2025-05-07T20:32:58.9801121Z 2025-05-07T20:32:58.9801292Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.9803059Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:58.9803071Z 2025-05-07T20:32:58.9803186Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:58.9803191Z 2025-05-07T20:32:58.9803293Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.9803512Z self=, 2025-05-07T20:32:58.9803588Z T=2048, 2025-05-07T20:32:58.9803709Z D=7168, 2025-05-07T20:32:58.9803796Z scale_ub=1200.0, 2025-05-07T20:32:58.9803875Z contiguous=True, 2025-05-07T20:32:58.9803993Z compiled=True, 2025-05-07T20:32:58.9804068Z ) 2025-05-07T20:32:58.9804279Z self = 2025-05-07T20:32:58.9804449Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:58.9804454Z 2025-05-07T20:32:58.9804531Z @given( 2025-05-07T20:32:58.9804644Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.9804747Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.9804858Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.9804973Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.9805087Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.9805156Z ) 2025-05-07T20:32:58.9805394Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.9805491Z def test_silu_mul_quant( 2025-05-07T20:32:58.9805562Z self, 2025-05-07T20:32:58.9805640Z T: int, 2025-05-07T20:32:58.9805714Z D: int, 2025-05-07T20:32:58.9805811Z scale_ub: Optional[float], 2025-05-07T20:32:58.9805906Z contiguous: bool, 2025-05-07T20:32:58.9805989Z compiled: bool, 2025-05-07T20:32:58.9806063Z ) -> None: 2025-05-07T20:32:58.9806204Z torch.manual_seed(2025) 2025-05-07T20:32:58.9806277Z 2025-05-07T20:32:58.9806443Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.9806520Z 2025-05-07T20:32:58.9806607Z x_sign = torch.sign(x) 2025-05-07T20:32:58.9806730Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:58.9808484Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:58.9808496Z 2025-05-07T20:32:58.9808611Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:58.9808620Z 2025-05-07T20:32:58.9808761Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.9808979Z self=, 2025-05-07T20:32:58.9809059Z T=2048, 2025-05-07T20:32:58.9809133Z D=7168, 2025-05-07T20:32:58.9809212Z scale_ub=None, 2025-05-07T20:32:58.9809299Z contiguous=True, 2025-05-07T20:32:58.9809378Z compiled=False, 2025-05-07T20:32:58.9809450Z ) 2025-05-07T20:32:58.9809663Z self = 2025-05-07T20:32:58.9809832Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:58.9809838Z 2025-05-07T20:32:58.9809908Z @given( 2025-05-07T20:32:58.9810024Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.9810118Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.9810231Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.9810345Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.9810458Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.9810530Z ) 2025-05-07T20:32:58.9810768Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.9810856Z def test_silu_mul_quant( 2025-05-07T20:32:58.9810928Z self, 2025-05-07T20:32:58.9811000Z T: int, 2025-05-07T20:32:58.9811070Z D: int, 2025-05-07T20:32:58.9811168Z scale_ub: Optional[float], 2025-05-07T20:32:58.9811330Z contiguous: bool, 2025-05-07T20:32:58.9811418Z compiled: bool, 2025-05-07T20:32:58.9811492Z ) -> None: 2025-05-07T20:32:58.9811623Z torch.manual_seed(2025) 2025-05-07T20:32:58.9811702Z 2025-05-07T20:32:58.9811864Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.9811934Z 2025-05-07T20:32:58.9812026Z > x_sign = torch.sign(x) 2025-05-07T20:32:58.9813963Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:58.9813975Z 2025-05-07T20:32:58.9814099Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:58.9814104Z 2025-05-07T20:32:58.9814202Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.9814418Z self=, 2025-05-07T20:32:58.9814495Z T=1, 2025-05-07T20:32:58.9814569Z D=7168, 2025-05-07T20:32:58.9814646Z scale_ub=1200.0, 2025-05-07T20:32:58.9814777Z contiguous=True, 2025-05-07T20:32:58.9814857Z compiled=False, 2025-05-07T20:32:58.9814932Z ) 2025-05-07T20:32:58.9815142Z self = 2025-05-07T20:32:58.9815303Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:58.9815307Z 2025-05-07T20:32:58.9815385Z @given( 2025-05-07T20:32:58.9815498Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.9815591Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.9815710Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.9815822Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.9815941Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.9816012Z ) 2025-05-07T20:32:58.9816252Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.9816347Z def test_silu_mul_quant( 2025-05-07T20:32:58.9816417Z self, 2025-05-07T20:32:58.9816492Z T: int, 2025-05-07T20:32:58.9816615Z D: int, 2025-05-07T20:32:58.9816709Z scale_ub: Optional[float], 2025-05-07T20:32:58.9816794Z contiguous: bool, 2025-05-07T20:32:58.9816879Z compiled: bool, 2025-05-07T20:32:58.9816957Z ) -> None: 2025-05-07T20:32:58.9817050Z torch.manual_seed(2025) 2025-05-07T20:32:58.9817123Z 2025-05-07T20:32:58.9817287Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.9817355Z 2025-05-07T20:32:58.9817452Z x_sign = torch.sign(x) 2025-05-07T20:32:58.9817575Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:58.9817672Z x = x_sign * x_clamp 2025-05-07T20:32:58.9817749Z x0 = x[:, :D] 2025-05-07T20:32:58.9817826Z x1 = x[:, D:] 2025-05-07T20:32:58.9817900Z 2025-05-07T20:32:58.9817982Z if contiguous: 2025-05-07T20:32:58.9818072Z x0 = x0.contiguous() 2025-05-07T20:32:58.9818168Z x1 = x1.contiguous() 2025-05-07T20:32:58.9818239Z 2025-05-07T20:32:58.9818328Z if scale_ub is not None: 2025-05-07T20:32:58.9818434Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:58.9818566Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:58.9818641Z ) 2025-05-07T20:32:58.9818725Z else: 2025-05-07T20:32:58.9818817Z scale_ub_tensor = None 2025-05-07T20:32:58.9818890Z 2025-05-07T20:32:58.9819017Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:58.9819150Z op = silu_mul_quant 2025-05-07T20:32:58.9819239Z if compiled: 2025-05-07T20:32:58.9819404Z op = torch.compile(op) 2025-05-07T20:32:58.9819512Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.9819584Z 2025-05-07T20:32:58.9819671Z > y_fp8, y_scale = fn() 2025-05-07T20:32:58.9819675Z 2025-05-07T20:32:58.9819820Z moe/activation_test.py:117: 2025-05-07T20:32:58.9819954Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.9820057Z moe/activation_test.py:115: in fn 2025-05-07T20:32:58.9820156Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.9820656Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:58.9820752Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:58.9821117Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:58.9821341Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:58.9821681Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:58.9821781Z kernel = self.compile( 2025-05-07T20:32:58.9822167Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:58.9822390Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:58.9822518Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.9822523Z 2025-05-07T20:32:58.9822726Z self = 2025-05-07T20:32:58.9823498Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:58.9824005Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3132d5c4c0>} 2025-05-07T20:32:58.9824751Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:58.9824975Z context = 2025-05-07T20:32:58.9824980Z 2025-05-07T20:32:58.9825143Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:58.9825406Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:58.9825510Z module_map=module_map) 2025-05-07T20:32:58.9825673Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:58.9825772Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:58.9825846Z E ^ 2025-05-07T20:32:58.9826201Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:58.9826206Z 2025-05-07T20:32:58.9826613Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:58.9826618Z 2025-05-07T20:32:58.9826727Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.9826947Z self=, 2025-05-07T20:32:58.9827022Z T=128, 2025-05-07T20:32:58.9827102Z D=5120, 2025-05-07T20:32:58.9827182Z scale_ub=None, 2025-05-07T20:32:58.9827261Z contiguous=True, 2025-05-07T20:32:58.9827344Z compiled=False, 2025-05-07T20:32:58.9827414Z ) 2025-05-07T20:32:58.9827625Z self = 2025-05-07T20:32:58.9827838Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:58.9827843Z 2025-05-07T20:32:58.9827978Z @given( 2025-05-07T20:32:58.9828111Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.9828217Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.9828327Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.9828444Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.9828556Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.9828628Z ) 2025-05-07T20:32:58.9828871Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.9828961Z def test_silu_mul_quant( 2025-05-07T20:32:58.9829034Z self, 2025-05-07T20:32:58.9829109Z T: int, 2025-05-07T20:32:58.9829177Z D: int, 2025-05-07T20:32:58.9829276Z scale_ub: Optional[float], 2025-05-07T20:32:58.9829364Z contiguous: bool, 2025-05-07T20:32:58.9829447Z compiled: bool, 2025-05-07T20:32:58.9829526Z ) -> None: 2025-05-07T20:32:58.9829618Z torch.manual_seed(2025) 2025-05-07T20:32:58.9829691Z 2025-05-07T20:32:58.9829856Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.9829926Z 2025-05-07T20:32:58.9830014Z x_sign = torch.sign(x) 2025-05-07T20:32:58.9830141Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:58.9830273Z x = x_sign * x_clamp 2025-05-07T20:32:58.9830356Z x0 = x[:, :D] 2025-05-07T20:32:58.9830433Z x1 = x[:, D:] 2025-05-07T20:32:58.9830500Z 2025-05-07T20:32:58.9830580Z if contiguous: 2025-05-07T20:32:58.9830675Z x0 = x0.contiguous() 2025-05-07T20:32:58.9830763Z x1 = x1.contiguous() 2025-05-07T20:32:58.9830839Z 2025-05-07T20:32:58.9830927Z if scale_ub is not None: 2025-05-07T20:32:58.9831029Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:58.9831169Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:58.9831242Z ) 2025-05-07T20:32:58.9831323Z else: 2025-05-07T20:32:58.9831420Z scale_ub_tensor = None 2025-05-07T20:32:58.9831488Z 2025-05-07T20:32:58.9831614Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:58.9831707Z op = silu_mul_quant 2025-05-07T20:32:58.9831790Z if compiled: 2025-05-07T20:32:58.9831893Z op = torch.compile(op) 2025-05-07T20:32:58.9832047Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.9832118Z 2025-05-07T20:32:58.9832212Z > y_fp8, y_scale = fn() 2025-05-07T20:32:58.9832217Z 2025-05-07T20:32:58.9832313Z moe/activation_test.py:117: 2025-05-07T20:32:58.9832437Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.9832540Z moe/activation_test.py:115: in fn 2025-05-07T20:32:58.9832635Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.9833130Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:58.9833228Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:58.9833581Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:58.9833811Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:58.9834147Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:58.9834240Z kernel = self.compile( 2025-05-07T20:32:58.9834625Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:58.9834799Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:58.9834922Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.9834975Z 2025-05-07T20:32:58.9835179Z self = 2025-05-07T20:32:58.9839212Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:58.9839747Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3132d5c940>} 2025-05-07T20:32:58.9840509Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:58.9840696Z context = 2025-05-07T20:32:58.9840702Z 2025-05-07T20:32:58.9840875Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:58.9841137Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:58.9841244Z module_map=module_map) 2025-05-07T20:32:58.9841411Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:58.9841507Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:58.9841580Z E ^ 2025-05-07T20:32:58.9841979Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:58.9841988Z 2025-05-07T20:32:58.9842404Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:58.9842410Z 2025-05-07T20:32:58.9842515Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.9842732Z self=, 2025-05-07T20:32:58.9842805Z T=128, 2025-05-07T20:32:58.9842881Z D=7168, 2025-05-07T20:32:58.9842961Z scale_ub=None, 2025-05-07T20:32:58.9843042Z contiguous=True, 2025-05-07T20:32:58.9843132Z compiled=False, 2025-05-07T20:32:58.9843205Z ) 2025-05-07T20:32:58.9843415Z self = 2025-05-07T20:32:58.9843587Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:58.9843592Z 2025-05-07T20:32:58.9843665Z @given( 2025-05-07T20:32:58.9843783Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.9843923Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.9844037Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.9844158Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.9844269Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.9844335Z ) 2025-05-07T20:32:58.9844580Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.9844673Z def test_silu_mul_quant( 2025-05-07T20:32:58.9844742Z self, 2025-05-07T20:32:58.9844819Z T: int, 2025-05-07T20:32:58.9844890Z D: int, 2025-05-07T20:32:58.9844989Z scale_ub: Optional[float], 2025-05-07T20:32:58.9845078Z contiguous: bool, 2025-05-07T20:32:58.9845163Z compiled: bool, 2025-05-07T20:32:58.9845244Z ) -> None: 2025-05-07T20:32:58.9845338Z torch.manual_seed(2025) 2025-05-07T20:32:58.9845409Z 2025-05-07T20:32:58.9845579Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.9845648Z 2025-05-07T20:32:58.9845741Z x_sign = torch.sign(x) 2025-05-07T20:32:58.9845870Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:58.9845955Z x = x_sign * x_clamp 2025-05-07T20:32:58.9846033Z x0 = x[:, :D] 2025-05-07T20:32:58.9846116Z x1 = x[:, D:] 2025-05-07T20:32:58.9846185Z 2025-05-07T20:32:58.9846311Z if contiguous: 2025-05-07T20:32:58.9846404Z x0 = x0.contiguous() 2025-05-07T20:32:58.9846490Z x1 = x1.contiguous() 2025-05-07T20:32:58.9846602Z 2025-05-07T20:32:58.9846696Z if scale_ub is not None: 2025-05-07T20:32:58.9846801Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:58.9846942Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:58.9847015Z ) 2025-05-07T20:32:58.9847090Z else: 2025-05-07T20:32:58.9847188Z scale_ub_tensor = None 2025-05-07T20:32:58.9847263Z 2025-05-07T20:32:58.9847392Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:58.9847485Z op = silu_mul_quant 2025-05-07T20:32:58.9847572Z if compiled: 2025-05-07T20:32:58.9847669Z op = torch.compile(op) 2025-05-07T20:32:58.9847779Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.9847849Z 2025-05-07T20:32:58.9847951Z > y_fp8, y_scale = fn() 2025-05-07T20:32:58.9847961Z 2025-05-07T20:32:58.9848070Z moe/activation_test.py:117: 2025-05-07T20:32:58.9848227Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.9848333Z moe/activation_test.py:115: in fn 2025-05-07T20:32:58.9848433Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.9848972Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:58.9849077Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:58.9849434Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:58.9849659Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:58.9849999Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:58.9850090Z kernel = self.compile( 2025-05-07T20:32:58.9850481Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:58.9850655Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:58.9850778Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.9850788Z 2025-05-07T20:32:58.9850990Z self = 2025-05-07T20:32:58.9851761Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:58.9852308Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3132d5d240>} 2025-05-07T20:32:58.9853047Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:58.9853240Z context = 2025-05-07T20:32:58.9853244Z 2025-05-07T20:32:58.9853404Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:58.9853665Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:58.9853775Z module_map=module_map) 2025-05-07T20:32:58.9853935Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:58.9854034Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:58.9854112Z E ^ 2025-05-07T20:32:58.9854465Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:58.9854470Z 2025-05-07T20:32:58.9854881Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:58.9854952Z 2025-05-07T20:32:58.9855089Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.9855307Z self=, 2025-05-07T20:32:58.9855384Z T=2048, 2025-05-07T20:32:58.9855453Z D=7168, 2025-05-07T20:32:58.9855534Z scale_ub=1200.0, 2025-05-07T20:32:58.9855617Z contiguous=True, 2025-05-07T20:32:58.9855698Z compiled=False, 2025-05-07T20:32:58.9855777Z ) 2025-05-07T20:32:58.9855990Z self = 2025-05-07T20:32:58.9856162Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:58.9856166Z 2025-05-07T20:32:58.9856239Z @given( 2025-05-07T20:32:58.9856356Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.9856452Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.9856577Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.9856692Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.9856811Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.9856882Z ) 2025-05-07T20:32:58.9857124Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.9857217Z def test_silu_mul_quant( 2025-05-07T20:32:58.9857286Z self, 2025-05-07T20:32:58.9857402Z T: int, 2025-05-07T20:32:58.9857481Z D: int, 2025-05-07T20:32:58.9857577Z scale_ub: Optional[float], 2025-05-07T20:32:58.9857668Z contiguous: bool, 2025-05-07T20:32:58.9857755Z compiled: bool, 2025-05-07T20:32:58.9857830Z ) -> None: 2025-05-07T20:32:58.9857919Z torch.manual_seed(2025) 2025-05-07T20:32:58.9857992Z 2025-05-07T20:32:58.9858155Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.9860099Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:58.9860164Z 2025-05-07T20:32:58.9860322Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:58.9860328Z 2025-05-07T20:32:58.9860471Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.9860778Z self=, 2025-05-07T20:32:58.9860875Z T=1, 2025-05-07T20:32:58.9860978Z D=5120, 2025-05-07T20:32:58.9861088Z scale_ub=1200.0, 2025-05-07T20:32:58.9861202Z contiguous=True, 2025-05-07T20:32:58.9861317Z compiled=False, 2025-05-07T20:32:58.9861413Z ) 2025-05-07T20:32:58.9861720Z self = 2025-05-07T20:32:58.9861947Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:58.9861954Z 2025-05-07T20:32:58.9862052Z @given( 2025-05-07T20:32:58.9862214Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.9862348Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.9862505Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.9862665Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.9862822Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.9862917Z ) 2025-05-07T20:32:58.9863184Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.9863279Z def test_silu_mul_quant( 2025-05-07T20:32:58.9863351Z self, 2025-05-07T20:32:58.9863475Z T: int, 2025-05-07T20:32:58.9863549Z D: int, 2025-05-07T20:32:58.9863642Z scale_ub: Optional[float], 2025-05-07T20:32:58.9863772Z contiguous: bool, 2025-05-07T20:32:58.9863860Z compiled: bool, 2025-05-07T20:32:58.9863936Z ) -> None: 2025-05-07T20:32:58.9864029Z torch.manual_seed(2025) 2025-05-07T20:32:58.9864099Z 2025-05-07T20:32:58.9864264Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.9864336Z 2025-05-07T20:32:58.9864430Z x_sign = torch.sign(x) 2025-05-07T20:32:58.9864552Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:58.9864640Z x = x_sign * x_clamp 2025-05-07T20:32:58.9864716Z x0 = x[:, :D] 2025-05-07T20:32:58.9864791Z x1 = x[:, D:] 2025-05-07T20:32:58.9864868Z 2025-05-07T20:32:58.9864946Z if contiguous: 2025-05-07T20:32:58.9865034Z x0 = x0.contiguous() 2025-05-07T20:32:58.9865122Z x1 = x1.contiguous() 2025-05-07T20:32:58.9865198Z 2025-05-07T20:32:58.9865288Z if scale_ub is not None: 2025-05-07T20:32:58.9865397Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:58.9865528Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:58.9865602Z ) 2025-05-07T20:32:58.9865675Z else: 2025-05-07T20:32:58.9865767Z scale_ub_tensor = None 2025-05-07T20:32:58.9865843Z 2025-05-07T20:32:58.9866011Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:58.9866102Z op = silu_mul_quant 2025-05-07T20:32:58.9866189Z if compiled: 2025-05-07T20:32:58.9866285Z op = torch.compile(op) 2025-05-07T20:32:58.9866388Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.9866460Z 2025-05-07T20:32:58.9866546Z > y_fp8, y_scale = fn() 2025-05-07T20:32:58.9866550Z 2025-05-07T20:32:58.9866647Z moe/activation_test.py:117: 2025-05-07T20:32:58.9866776Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.9866872Z moe/activation_test.py:115: in fn 2025-05-07T20:32:58.9866974Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.9867466Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:58.9867561Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:58.9867922Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:58.9868220Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:58.9868582Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:58.9868675Z kernel = self.compile( 2025-05-07T20:32:58.9869052Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:58.9869227Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:58.9869356Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.9869361Z 2025-05-07T20:32:58.9869563Z self = 2025-05-07T20:32:58.9870341Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:58.9870841Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3132d5e200>} 2025-05-07T20:32:58.9871585Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:58.9871816Z context = 2025-05-07T20:32:58.9871858Z 2025-05-07T20:32:58.9872028Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:58.9872285Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:58.9872395Z module_map=module_map) 2025-05-07T20:32:58.9872563Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:58.9872662Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:58.9872734Z E ^ 2025-05-07T20:32:58.9873088Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:58.9873093Z 2025-05-07T20:32:58.9873502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:58.9873510Z 2025-05-07T20:32:58.9873616Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.9873838Z self=, 2025-05-07T20:32:58.9873913Z T=2048, 2025-05-07T20:32:58.9873991Z D=5120, 2025-05-07T20:32:58.9874067Z scale_ub=None, 2025-05-07T20:32:58.9874149Z contiguous=True, 2025-05-07T20:32:58.9874232Z compiled=False, 2025-05-07T20:32:58.9874302Z ) 2025-05-07T20:32:58.9874560Z self = 2025-05-07T20:32:58.9874735Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:58.9874739Z 2025-05-07T20:32:58.9874814Z @given( 2025-05-07T20:32:58.9874930Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.9875025Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.9875138Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.9875256Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.9875368Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.9875440Z ) 2025-05-07T20:32:58.9875698Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.9875789Z def test_silu_mul_quant( 2025-05-07T20:32:58.9875862Z self, 2025-05-07T20:32:58.9875933Z T: int, 2025-05-07T20:32:58.9876003Z D: int, 2025-05-07T20:32:58.9876108Z scale_ub: Optional[float], 2025-05-07T20:32:58.9876194Z contiguous: bool, 2025-05-07T20:32:58.9876322Z compiled: bool, 2025-05-07T20:32:58.9876401Z ) -> None: 2025-05-07T20:32:58.9876490Z torch.manual_seed(2025) 2025-05-07T20:32:58.9876561Z 2025-05-07T20:32:58.9876727Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.9876797Z 2025-05-07T20:32:58.9876888Z > x_sign = torch.sign(x) 2025-05-07T20:32:58.9878717Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:58.9878727Z 2025-05-07T20:32:58.9878845Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:58.9878853Z 2025-05-07T20:32:58.9878953Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.9879171Z self=, 2025-05-07T20:32:58.9879248Z T=16384, 2025-05-07T20:32:58.9879320Z D=5120, 2025-05-07T20:32:58.9879398Z scale_ub=None, 2025-05-07T20:32:58.9879480Z contiguous=True, 2025-05-07T20:32:58.9879605Z compiled=False, 2025-05-07T20:32:58.9879677Z ) 2025-05-07T20:32:58.9879930Z self = 2025-05-07T20:32:58.9880105Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:58.9880109Z 2025-05-07T20:32:58.9880191Z @given( 2025-05-07T20:32:58.9880304Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.9880399Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.9880515Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.9880630Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.9880743Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.9880816Z ) 2025-05-07T20:32:58.9881058Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.9881150Z def test_silu_mul_quant( 2025-05-07T20:32:58.9881229Z self, 2025-05-07T20:32:58.9881303Z T: int, 2025-05-07T20:32:58.9881378Z D: int, 2025-05-07T20:32:58.9881477Z scale_ub: Optional[float], 2025-05-07T20:32:58.9881566Z contiguous: bool, 2025-05-07T20:32:58.9881651Z compiled: bool, 2025-05-07T20:32:58.9881726Z ) -> None: 2025-05-07T20:32:58.9881816Z torch.manual_seed(2025) 2025-05-07T20:32:58.9881886Z 2025-05-07T20:32:58.9882049Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.9883856Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:58.9883873Z 2025-05-07T20:32:58.9883988Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:58.9883995Z 2025-05-07T20:32:58.9884093Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.9884312Z self=, 2025-05-07T20:32:58.9884385Z T=4096, 2025-05-07T20:32:58.9884455Z D=5120, 2025-05-07T20:32:58.9884538Z scale_ub=None, 2025-05-07T20:32:58.9884620Z contiguous=True, 2025-05-07T20:32:58.9884769Z compiled=False, 2025-05-07T20:32:58.9884841Z ) 2025-05-07T20:32:58.9885050Z self = 2025-05-07T20:32:58.9885223Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:58.9885227Z 2025-05-07T20:32:58.9885302Z @given( 2025-05-07T20:32:58.9885414Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.9885514Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.9885627Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.9885741Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.9885855Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.9885925Z ) 2025-05-07T20:32:58.9886171Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.9886261Z def test_silu_mul_quant( 2025-05-07T20:32:58.9886336Z self, 2025-05-07T20:32:58.9886418Z T: int, 2025-05-07T20:32:58.9886490Z D: int, 2025-05-07T20:32:58.9886584Z scale_ub: Optional[float], 2025-05-07T20:32:58.9886675Z contiguous: bool, 2025-05-07T20:32:58.9886755Z compiled: bool, 2025-05-07T20:32:58.9886828Z ) -> None: 2025-05-07T20:32:58.9886921Z torch.manual_seed(2025) 2025-05-07T20:32:58.9886989Z 2025-05-07T20:32:58.9887155Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.9889004Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:58.9889013Z 2025-05-07T20:32:58.9889129Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:58.9889137Z 2025-05-07T20:32:58.9889234Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.9889448Z self=, 2025-05-07T20:32:58.9889524Z T=2048, 2025-05-07T20:32:58.9889592Z D=5120, 2025-05-07T20:32:58.9889670Z scale_ub=None, 2025-05-07T20:32:58.9889765Z contiguous=False, 2025-05-07T20:32:58.9890082Z compiled=False, 2025-05-07T20:32:58.9890194Z ) 2025-05-07T20:32:58.9890458Z self = 2025-05-07T20:32:58.9890628Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:58.9890632Z 2025-05-07T20:32:58.9890706Z @given( 2025-05-07T20:32:58.9890912Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.9891009Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.9891130Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.9891241Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.9891352Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.9891432Z ) 2025-05-07T20:32:58.9891678Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.9891771Z def test_silu_mul_quant( 2025-05-07T20:32:58.9891851Z self, 2025-05-07T20:32:58.9891922Z T: int, 2025-05-07T20:32:58.9891994Z D: int, 2025-05-07T20:32:58.9892094Z scale_ub: Optional[float], 2025-05-07T20:32:58.9892179Z contiguous: bool, 2025-05-07T20:32:58.9892264Z compiled: bool, 2025-05-07T20:32:58.9892340Z ) -> None: 2025-05-07T20:32:58.9892430Z torch.manual_seed(2025) 2025-05-07T20:32:58.9892504Z 2025-05-07T20:32:58.9892670Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.9894509Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:58.9894520Z 2025-05-07T20:32:58.9894640Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:58.9894644Z 2025-05-07T20:32:58.9894743Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.9894962Z self=, 2025-05-07T20:32:58.9895034Z T=4096, 2025-05-07T20:32:58.9895105Z D=7168, 2025-05-07T20:32:58.9895189Z scale_ub=None, 2025-05-07T20:32:58.9895271Z contiguous=True, 2025-05-07T20:32:58.9895352Z compiled=True, 2025-05-07T20:32:58.9895422Z ) 2025-05-07T20:32:58.9895630Z self = 2025-05-07T20:32:58.9895797Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:58.9895802Z 2025-05-07T20:32:58.9895873Z @given( 2025-05-07T20:32:58.9895985Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.9896148Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.9896317Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.9896433Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.9896549Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.9896621Z ) 2025-05-07T20:32:58.9896869Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.9896959Z def test_silu_mul_quant( 2025-05-07T20:32:58.9897037Z self, 2025-05-07T20:32:58.9897112Z T: int, 2025-05-07T20:32:58.9897186Z D: int, 2025-05-07T20:32:58.9897279Z scale_ub: Optional[float], 2025-05-07T20:32:58.9897371Z contiguous: bool, 2025-05-07T20:32:58.9897453Z compiled: bool, 2025-05-07T20:32:58.9897527Z ) -> None: 2025-05-07T20:32:58.9897626Z torch.manual_seed(2025) 2025-05-07T20:32:58.9897696Z 2025-05-07T20:32:58.9897857Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.9899657Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:58.9899667Z 2025-05-07T20:32:58.9899868Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:58.9899881Z 2025-05-07T20:32:58.9900019Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.9900326Z self=, 2025-05-07T20:32:58.9900434Z T=2048, 2025-05-07T20:32:58.9900538Z D=5120, 2025-05-07T20:32:58.9900647Z scale_ub=1200.0, 2025-05-07T20:32:58.9900766Z contiguous=False, 2025-05-07T20:32:58.9900876Z compiled=False, 2025-05-07T20:32:58.9900972Z ) 2025-05-07T20:32:58.9901486Z self = 2025-05-07T20:32:58.9901731Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:58.9901738Z 2025-05-07T20:32:58.9901843Z @given( 2025-05-07T20:32:58.9901995Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.9902179Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.9902297Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.9902411Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.9902520Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.9902598Z ) 2025-05-07T20:32:58.9902838Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.9902933Z def test_silu_mul_quant( 2025-05-07T20:32:58.9903013Z self, 2025-05-07T20:32:58.9903089Z T: int, 2025-05-07T20:32:58.9903161Z D: int, 2025-05-07T20:32:58.9903264Z scale_ub: Optional[float], 2025-05-07T20:32:58.9903354Z contiguous: bool, 2025-05-07T20:32:58.9903445Z compiled: bool, 2025-05-07T20:32:58.9903521Z ) -> None: 2025-05-07T20:32:58.9903618Z torch.manual_seed(2025) 2025-05-07T20:32:58.9903696Z 2025-05-07T20:32:58.9903863Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.9905658Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:58.9905706Z 2025-05-07T20:32:58.9905822Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:58.9905827Z 2025-05-07T20:32:58.9905927Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.9906152Z self=, 2025-05-07T20:32:58.9906230Z T=4096, 2025-05-07T20:32:58.9906304Z D=7168, 2025-05-07T20:32:58.9906386Z scale_ub=1200.0, 2025-05-07T20:32:58.9906468Z contiguous=True, 2025-05-07T20:32:58.9906549Z compiled=False, 2025-05-07T20:32:58.9906623Z ) 2025-05-07T20:32:58.9906833Z self = 2025-05-07T20:32:58.9907003Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:58.9907008Z 2025-05-07T20:32:58.9907084Z @given( 2025-05-07T20:32:58.9907198Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.9907305Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.9907420Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.9907530Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.9907646Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.9907724Z ) 2025-05-07T20:32:58.9908035Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.9908144Z def test_silu_mul_quant( 2025-05-07T20:32:58.9908226Z self, 2025-05-07T20:32:58.9908302Z T: int, 2025-05-07T20:32:58.9908374Z D: int, 2025-05-07T20:32:58.9908467Z scale_ub: Optional[float], 2025-05-07T20:32:58.9908556Z contiguous: bool, 2025-05-07T20:32:58.9908638Z compiled: bool, 2025-05-07T20:32:58.9908712Z ) -> None: 2025-05-07T20:32:58.9908807Z torch.manual_seed(2025) 2025-05-07T20:32:58.9908884Z 2025-05-07T20:32:58.9909050Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.9910819Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:58.9910866Z 2025-05-07T20:32:58.9910984Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:58.9910994Z 2025-05-07T20:32:58.9911094Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.9911313Z self=, 2025-05-07T20:32:58.9911396Z T=16384, 2025-05-07T20:32:58.9911469Z D=7168, 2025-05-07T20:32:58.9911553Z scale_ub=None, 2025-05-07T20:32:58.9911641Z contiguous=False, 2025-05-07T20:32:58.9911721Z compiled=True, 2025-05-07T20:32:58.9911791Z ) 2025-05-07T20:32:58.9912004Z self = 2025-05-07T20:32:58.9912178Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:58.9912182Z 2025-05-07T20:32:58.9912256Z @given( 2025-05-07T20:32:58.9912372Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.9912466Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.9912579Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.9912693Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.9912802Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.9912874Z ) 2025-05-07T20:32:58.9913162Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.9913256Z def test_silu_mul_quant( 2025-05-07T20:32:58.9913375Z self, 2025-05-07T20:32:58.9913451Z T: int, 2025-05-07T20:32:58.9913524Z D: int, 2025-05-07T20:32:58.9913624Z scale_ub: Optional[float], 2025-05-07T20:32:58.9913710Z contiguous: bool, 2025-05-07T20:32:58.9913800Z compiled: bool, 2025-05-07T20:32:58.9913876Z ) -> None: 2025-05-07T20:32:58.9913967Z torch.manual_seed(2025) 2025-05-07T20:32:58.9914040Z 2025-05-07T20:32:58.9914204Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.9915962Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:58.9915974Z 2025-05-07T20:32:58.9916088Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:58.9916093Z 2025-05-07T20:32:58.9916262Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.9916485Z self=, 2025-05-07T20:32:58.9916565Z T=4096, 2025-05-07T20:32:58.9916637Z D=7168, 2025-05-07T20:32:58.9916726Z scale_ub=None, 2025-05-07T20:32:58.9916809Z contiguous=True, 2025-05-07T20:32:58.9916893Z compiled=False, 2025-05-07T20:32:58.9916972Z ) 2025-05-07T20:32:58.9917181Z self = 2025-05-07T20:32:58.9917354Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:58.9917361Z 2025-05-07T20:32:58.9917433Z @given( 2025-05-07T20:32:58.9917551Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.9917652Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.9917764Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.9917877Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.9917999Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.9918075Z ) 2025-05-07T20:32:58.9918369Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.9918461Z def test_silu_mul_quant( 2025-05-07T20:32:58.9918535Z self, 2025-05-07T20:32:58.9918614Z T: int, 2025-05-07T20:32:58.9918691Z D: int, 2025-05-07T20:32:58.9918787Z scale_ub: Optional[float], 2025-05-07T20:32:58.9918875Z contiguous: bool, 2025-05-07T20:32:58.9918957Z compiled: bool, 2025-05-07T20:32:58.9919036Z ) -> None: 2025-05-07T20:32:58.9919130Z torch.manual_seed(2025) 2025-05-07T20:32:58.9919201Z 2025-05-07T20:32:58.9919371Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.9921129Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:58.9921138Z 2025-05-07T20:32:58.9921254Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:58.9921264Z 2025-05-07T20:32:58.9921362Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.9921621Z self=, 2025-05-07T20:32:58.9921735Z T=16384, 2025-05-07T20:32:58.9921810Z D=7168, 2025-05-07T20:32:58.9921892Z scale_ub=None, 2025-05-07T20:32:58.9921974Z contiguous=True, 2025-05-07T20:32:58.9922053Z compiled=False, 2025-05-07T20:32:58.9922123Z ) 2025-05-07T20:32:58.9922342Z self = 2025-05-07T20:32:58.9922518Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:58.9922525Z 2025-05-07T20:32:58.9922596Z @given( 2025-05-07T20:32:58.9922714Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.9922809Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.9922925Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.9923038Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.9923147Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.9923220Z ) 2025-05-07T20:32:58.9923466Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.9923557Z def test_silu_mul_quant( 2025-05-07T20:32:58.9923633Z self, 2025-05-07T20:32:58.9923702Z T: int, 2025-05-07T20:32:58.9923773Z D: int, 2025-05-07T20:32:58.9923874Z scale_ub: Optional[float], 2025-05-07T20:32:58.9924005Z contiguous: bool, 2025-05-07T20:32:58.9924093Z compiled: bool, 2025-05-07T20:32:58.9924169Z ) -> None: 2025-05-07T20:32:58.9924262Z torch.manual_seed(2025) 2025-05-07T20:32:58.9924337Z 2025-05-07T20:32:58.9924499Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.9926255Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:58.9926267Z 2025-05-07T20:32:58.9926383Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:58.9926391Z 2025-05-07T20:32:58.9926492Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.9926782Z self=, 2025-05-07T20:32:58.9926853Z T=16384, 2025-05-07T20:32:58.9926925Z D=7168, 2025-05-07T20:32:58.9927012Z scale_ub=1200.0, 2025-05-07T20:32:58.9927095Z contiguous=True, 2025-05-07T20:32:58.9927175Z compiled=False, 2025-05-07T20:32:58.9927247Z ) 2025-05-07T20:32:58.9927458Z self = 2025-05-07T20:32:58.9927638Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:58.9927643Z 2025-05-07T20:32:58.9927716Z @given( 2025-05-07T20:32:58.9927831Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.9927927Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.9928037Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.9928153Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.9928267Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.9928341Z ) 2025-05-07T20:32:58.9928595Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.9928686Z def test_silu_mul_quant( 2025-05-07T20:32:58.9928760Z self, 2025-05-07T20:32:58.9928837Z T: int, 2025-05-07T20:32:58.9928905Z D: int, 2025-05-07T20:32:58.9929004Z scale_ub: Optional[float], 2025-05-07T20:32:58.9929092Z contiguous: bool, 2025-05-07T20:32:58.9929220Z compiled: bool, 2025-05-07T20:32:58.9929293Z ) -> None: 2025-05-07T20:32:58.9929423Z torch.manual_seed(2025) 2025-05-07T20:32:58.9929494Z 2025-05-07T20:32:58.9929662Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.9931422Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:58.9931431Z 2025-05-07T20:32:58.9931545Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:58.9931556Z 2025-05-07T20:32:58.9931656Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.9931875Z self=, 2025-05-07T20:32:58.9931953Z T=128, 2025-05-07T20:32:58.9932024Z D=5120, 2025-05-07T20:32:58.9932101Z scale_ub=1200.0, 2025-05-07T20:32:58.9932190Z contiguous=False, 2025-05-07T20:32:58.9932270Z compiled=False, 2025-05-07T20:32:58.9932341Z ) 2025-05-07T20:32:58.9932597Z self = 2025-05-07T20:32:58.9932772Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:58.9932776Z 2025-05-07T20:32:58.9932846Z @given( 2025-05-07T20:32:58.9932964Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.9933057Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.9933176Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.9933291Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.9933405Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.9933479Z ) 2025-05-07T20:32:58.9933726Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.9933816Z def test_silu_mul_quant( 2025-05-07T20:32:58.9933897Z self, 2025-05-07T20:32:58.9933968Z T: int, 2025-05-07T20:32:58.9934041Z D: int, 2025-05-07T20:32:58.9934146Z scale_ub: Optional[float], 2025-05-07T20:32:58.9934277Z contiguous: bool, 2025-05-07T20:32:58.9934367Z compiled: bool, 2025-05-07T20:32:58.9934439Z ) -> None: 2025-05-07T20:32:58.9934526Z torch.manual_seed(2025) 2025-05-07T20:32:58.9934594Z 2025-05-07T20:32:58.9934756Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.9934827Z 2025-05-07T20:32:58.9934918Z x_sign = torch.sign(x) 2025-05-07T20:32:58.9935039Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:58.9935128Z x = x_sign * x_clamp 2025-05-07T20:32:58.9935207Z x0 = x[:, :D] 2025-05-07T20:32:58.9935285Z x1 = x[:, D:] 2025-05-07T20:32:58.9935355Z 2025-05-07T20:32:58.9935439Z if contiguous: 2025-05-07T20:32:58.9935527Z x0 = x0.contiguous() 2025-05-07T20:32:58.9935612Z x1 = x1.contiguous() 2025-05-07T20:32:58.9935681Z 2025-05-07T20:32:58.9935770Z if scale_ub is not None: 2025-05-07T20:32:58.9935879Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:58.9936017Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:58.9936092Z ) 2025-05-07T20:32:58.9936167Z else: 2025-05-07T20:32:58.9936261Z scale_ub_tensor = None 2025-05-07T20:32:58.9936331Z 2025-05-07T20:32:58.9936463Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:58.9936549Z op = silu_mul_quant 2025-05-07T20:32:58.9936632Z if compiled: 2025-05-07T20:32:58.9936777Z op = torch.compile(op) 2025-05-07T20:32:58.9936880Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.9936987Z 2025-05-07T20:32:58.9937081Z > y_fp8, y_scale = fn() 2025-05-07T20:32:58.9937086Z 2025-05-07T20:32:58.9937180Z moe/activation_test.py:117: 2025-05-07T20:32:58.9937307Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.9937412Z moe/activation_test.py:115: in fn 2025-05-07T20:32:58.9937508Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.9938012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:58.9938105Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:58.9938465Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:58.9938690Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:58.9939037Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:58.9939131Z kernel = self.compile( 2025-05-07T20:32:58.9939519Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:58.9939693Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:58.9940017Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.9940028Z 2025-05-07T20:32:58.9940320Z self = 2025-05-07T20:32:58.9941483Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:58.9942223Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3133089ea0>} 2025-05-07T20:32:58.9943128Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:58.9943325Z context = 2025-05-07T20:32:58.9943377Z 2025-05-07T20:32:58.9943560Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:58.9943872Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:58.9943983Z module_map=module_map) 2025-05-07T20:32:58.9944159Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:58.9944269Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:58.9944352Z E ^ 2025-05-07T20:32:58.9944786Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:58.9944790Z 2025-05-07T20:32:58.9945292Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:58.9945296Z 2025-05-07T20:32:58.9945403Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.9945663Z self=, 2025-05-07T20:32:58.9945742Z T=2048, 2025-05-07T20:32:58.9945819Z D=7168, 2025-05-07T20:32:58.9945904Z scale_ub=None, 2025-05-07T20:32:58.9945988Z contiguous=False, 2025-05-07T20:32:58.9946074Z compiled=False, 2025-05-07T20:32:58.9946147Z ) 2025-05-07T20:32:58.9946361Z self = 2025-05-07T20:32:58.9946541Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:58.9946590Z 2025-05-07T20:32:58.9946668Z @given( 2025-05-07T20:32:58.9946786Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.9946950Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.9947069Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.9947187Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.9947306Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.9947381Z ) 2025-05-07T20:32:58.9947627Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.9947726Z def test_silu_mul_quant( 2025-05-07T20:32:58.9947803Z self, 2025-05-07T20:32:58.9947886Z T: int, 2025-05-07T20:32:58.9947963Z D: int, 2025-05-07T20:32:58.9948063Z scale_ub: Optional[float], 2025-05-07T20:32:58.9948155Z contiguous: bool, 2025-05-07T20:32:58.9948241Z compiled: bool, 2025-05-07T20:32:58.9948319Z ) -> None: 2025-05-07T20:32:58.9948420Z torch.manual_seed(2025) 2025-05-07T20:32:58.9948492Z 2025-05-07T20:32:58.9948661Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.9950493Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:58.9950502Z 2025-05-07T20:32:58.9950622Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:58.9950631Z 2025-05-07T20:32:58.9950736Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.9950961Z self=, 2025-05-07T20:32:58.9951045Z T=128, 2025-05-07T20:32:58.9951130Z D=7168, 2025-05-07T20:32:58.9951214Z scale_ub=1200.0, 2025-05-07T20:32:58.9951301Z contiguous=True, 2025-05-07T20:32:58.9951384Z compiled=True, 2025-05-07T20:32:58.9951457Z ) 2025-05-07T20:32:58.9951674Z self = 2025-05-07T20:32:58.9951841Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:58.9951887Z 2025-05-07T20:32:58.9951963Z @given( 2025-05-07T20:32:58.9952078Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.9952174Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.9952292Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.9952407Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.9952520Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.9952598Z ) 2025-05-07T20:32:58.9952838Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.9952934Z def test_silu_mul_quant( 2025-05-07T20:32:58.9953017Z self, 2025-05-07T20:32:58.9953091Z T: int, 2025-05-07T20:32:58.9953167Z D: int, 2025-05-07T20:32:58.9953271Z scale_ub: Optional[float], 2025-05-07T20:32:58.9953359Z contiguous: bool, 2025-05-07T20:32:58.9953452Z compiled: bool, 2025-05-07T20:32:58.9953529Z ) -> None: 2025-05-07T20:32:58.9953625Z torch.manual_seed(2025) 2025-05-07T20:32:58.9953701Z 2025-05-07T20:32:58.9953869Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.9953942Z 2025-05-07T20:32:58.9954039Z x_sign = torch.sign(x) 2025-05-07T20:32:58.9954162Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:58.9954251Z x = x_sign * x_clamp 2025-05-07T20:32:58.9954335Z x0 = x[:, :D] 2025-05-07T20:32:58.9954460Z x1 = x[:, D:] 2025-05-07T20:32:58.9954532Z 2025-05-07T20:32:58.9954621Z if contiguous: 2025-05-07T20:32:58.9954749Z x0 = x0.contiguous() 2025-05-07T20:32:58.9954841Z x1 = x1.contiguous() 2025-05-07T20:32:58.9954913Z 2025-05-07T20:32:58.9955001Z if scale_ub is not None: 2025-05-07T20:32:58.9955111Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:58.9955248Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:58.9955326Z ) 2025-05-07T20:32:58.9955409Z else: 2025-05-07T20:32:58.9955506Z scale_ub_tensor = None 2025-05-07T20:32:58.9955578Z 2025-05-07T20:32:58.9955710Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:58.9955799Z op = silu_mul_quant 2025-05-07T20:32:58.9955882Z if compiled: 2025-05-07T20:32:58.9955985Z op = torch.compile(op) 2025-05-07T20:32:58.9956091Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.9956171Z 2025-05-07T20:32:58.9956262Z > y_fp8, y_scale = fn() 2025-05-07T20:32:58.9956266Z 2025-05-07T20:32:58.9956365Z moe/activation_test.py:117: 2025-05-07T20:32:58.9956496Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.9956596Z moe/activation_test.py:115: in fn 2025-05-07T20:32:58.9956696Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.9957113Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:58.9957214Z return fn(*args, **kwargs) 2025-05-07T20:32:58.9957713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:58.9957814Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:58.9958173Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:58.9958402Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:58.9958743Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:58.9958838Z kernel = self.compile( 2025-05-07T20:32:58.9959219Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:58.9959399Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:58.9959573Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.9959578Z 2025-05-07T20:32:58.9959784Z self = 2025-05-07T20:32:58.9960555Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:58.9961060Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f313308b7f0>} 2025-05-07T20:32:58.9961804Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:58.9961997Z context = 2025-05-07T20:32:58.9962004Z 2025-05-07T20:32:58.9962167Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:58.9962428Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:58.9962540Z module_map=module_map) 2025-05-07T20:32:58.9962703Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:58.9962883Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:58.9962997Z E ^ 2025-05-07T20:32:58.9963532Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:58.9963538Z 2025-05-07T20:32:58.9963967Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:58.9963972Z 2025-05-07T20:32:58.9964078Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.9964303Z self=, 2025-05-07T20:32:58.9964380Z T=128, 2025-05-07T20:32:58.9964457Z D=7168, 2025-05-07T20:32:58.9964544Z scale_ub=1200.0, 2025-05-07T20:32:58.9964632Z contiguous=True, 2025-05-07T20:32:58.9967745Z compiled=False, 2025-05-07T20:32:58.9967826Z ) 2025-05-07T20:32:58.9968058Z self = 2025-05-07T20:32:58.9968242Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:58.9968248Z 2025-05-07T20:32:58.9968327Z @given( 2025-05-07T20:32:58.9968458Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.9968559Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.9968679Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.9968802Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.9968983Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.9969066Z ) 2025-05-07T20:32:58.9969327Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.9969424Z def test_silu_mul_quant( 2025-05-07T20:32:58.9969505Z self, 2025-05-07T20:32:58.9969585Z T: int, 2025-05-07T20:32:58.9969665Z D: int, 2025-05-07T20:32:58.9969773Z scale_ub: Optional[float], 2025-05-07T20:32:58.9969867Z contiguous: bool, 2025-05-07T20:32:58.9969959Z compiled: bool, 2025-05-07T20:32:58.9970046Z ) -> None: 2025-05-07T20:32:58.9970144Z torch.manual_seed(2025) 2025-05-07T20:32:58.9970223Z 2025-05-07T20:32:58.9970401Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.9970481Z 2025-05-07T20:32:58.9970575Z x_sign = torch.sign(x) 2025-05-07T20:32:58.9970708Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:58.9972471Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:58.9972527Z 2025-05-07T20:32:58.9972656Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:58.9972661Z 2025-05-07T20:32:58.9972766Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.9972996Z self=, 2025-05-07T20:32:58.9973077Z T=128, 2025-05-07T20:32:58.9973157Z D=5120, 2025-05-07T20:32:58.9973246Z scale_ub=1200.0, 2025-05-07T20:32:58.9973335Z contiguous=True, 2025-05-07T20:32:58.9973421Z compiled=True, 2025-05-07T20:32:58.9973511Z ) 2025-05-07T20:32:58.9973728Z self = 2025-05-07T20:32:58.9973895Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:58.9973906Z 2025-05-07T20:32:58.9973984Z @given( 2025-05-07T20:32:58.9974102Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.9974204Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.9974368Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.9974486Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.9974644Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.9974723Z ) 2025-05-07T20:32:58.9974972Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.9975070Z def test_silu_mul_quant( 2025-05-07T20:32:58.9975149Z self, 2025-05-07T20:32:58.9975229Z T: int, 2025-05-07T20:32:58.9975310Z D: int, 2025-05-07T20:32:58.9975408Z scale_ub: Optional[float], 2025-05-07T20:32:58.9975504Z contiguous: bool, 2025-05-07T20:32:58.9975591Z compiled: bool, 2025-05-07T20:32:58.9975670Z ) -> None: 2025-05-07T20:32:58.9975769Z torch.manual_seed(2025) 2025-05-07T20:32:58.9975844Z 2025-05-07T20:32:58.9976010Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.9976090Z 2025-05-07T20:32:58.9976187Z x_sign = torch.sign(x) 2025-05-07T20:32:58.9976311Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:58.9978109Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:58.9978119Z 2025-05-07T20:32:58.9978236Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:58.9978241Z 2025-05-07T20:32:58.9978347Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.9978568Z self=, 2025-05-07T20:32:58.9978650Z T=128, 2025-05-07T20:32:58.9978732Z D=7168, 2025-05-07T20:32:58.9978818Z scale_ub=None, 2025-05-07T20:32:58.9978911Z contiguous=True, 2025-05-07T20:32:58.9978998Z compiled=True, 2025-05-07T20:32:58.9979074Z ) 2025-05-07T20:32:58.9979290Z self = 2025-05-07T20:32:58.9979458Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:58.9979462Z 2025-05-07T20:32:58.9979583Z @given( 2025-05-07T20:32:58.9979701Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.9979922Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.9980046Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.9980163Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.9980275Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.9980354Z ) 2025-05-07T20:32:58.9980604Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.9980697Z def test_silu_mul_quant( 2025-05-07T20:32:58.9980777Z self, 2025-05-07T20:32:58.9980852Z T: int, 2025-05-07T20:32:58.9980929Z D: int, 2025-05-07T20:32:58.9981029Z scale_ub: Optional[float], 2025-05-07T20:32:58.9981118Z contiguous: bool, 2025-05-07T20:32:58.9981203Z compiled: bool, 2025-05-07T20:32:58.9981288Z ) -> None: 2025-05-07T20:32:58.9981382Z torch.manual_seed(2025) 2025-05-07T20:32:58.9981460Z 2025-05-07T20:32:58.9981625Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.9983449Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:58.9983496Z 2025-05-07T20:32:58.9983615Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:58.9983747Z =============================== warnings summary =============================== 2025-05-07T20:32:58.9984057Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:32:58.9984360Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:32:58.9984656Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:32:58.9985539Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details. 2025-05-07T20:32:58.9985774Z warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See " 2025-05-07T20:32:58.9985778Z 2025-05-07T20:32:58.9985995Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html 2025-05-07T20:32:58.9986201Z ================= 1 failed, 1 deselected, 3 warnings in 17.41s ================= 2025-05-07T20:33:00.5501814Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py` failed. (See above for error) 2025-05-07T20:33:00.6128138Z [EXEC] [ATTEMPT 2/2] Command attempt failed. 2025-05-07T20:33:00.6128404Z 2025-05-07T20:33:00.6128575Z [EXEC] The command has failed after 2 + 1 attempts; aborting. 2025-05-07T20:33:00.6129145Z [TEST] Python test suite FAILED for some or all tests despite multiple retries: ./moe/activation_test.py 2025-05-07T20:33:00.6129573Z 2025-05-07T20:33:00.6129578Z 2025-05-07T20:33:00.6129582Z 2025-05-07T20:33:00.6145500Z ##[error]Process completed with exit code 1. 2025-05-07T20:33:00.6225477Z Post job cleanup. 2025-05-07T20:33:00.7218628Z [command]/usr/bin/git version 2025-05-07T20:33:00.7263394Z git version 2.47.1 2025-05-07T20:33:00.7302206Z Copying '/home/ec2-user/.gitconfig' to '/home/ec2-user/actions-runner/_work/_temp/0dcfceed-1031-4d6a-9b2c-6229e635b8d3/.gitconfig' 2025-05-07T20:33:00.7313022Z Temporarily overriding HOME='/home/ec2-user/actions-runner/_work/_temp/0dcfceed-1031-4d6a-9b2c-6229e635b8d3' before making global git config changes 2025-05-07T20:33:00.7313890Z Adding repository directory to the temporary git global config as a safe directory 2025-05-07T20:33:00.7318432Z [command]/usr/bin/git config --global --add safe.directory /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM 2025-05-07T20:33:00.7370502Z [command]/usr/bin/git config --local --name-only --get-regexp core\.sshCommand 2025-05-07T20:33:00.7405322Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'core\.sshCommand' && git config --local --unset-all 'core.sshCommand' || :" 2025-05-07T20:33:00.7745575Z Entering 'external/asmjit' 2025-05-07T20:33:00.7812472Z Entering 'external/composable_kernel' 2025-05-07T20:33:00.7885813Z Entering 'external/cpuinfo' 2025-05-07T20:33:00.7953612Z Entering 'external/cutlass' 2025-05-07T20:33:00.8029624Z Entering 'external/googletest' 2025-05-07T20:33:00.8096246Z Entering 'external/hipify_torch' 2025-05-07T20:33:00.8164975Z Entering 'external/json' 2025-05-07T20:33:00.8251299Z [command]/usr/bin/git config --local --name-only --get-regexp http\.https\:\/\/github\.com\/\.extraheader 2025-05-07T20:33:00.8274094Z http.https://github.com/.extraheader 2025-05-07T20:33:00.8284436Z [command]/usr/bin/git config --local --unset-all http.https://github.com/.extraheader 2025-05-07T20:33:00.8315235Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'http\.https\:\/\/github\.com\/\.extraheader' && git config --local --unset-all 'http.https://github.com/.extraheader' || :" 2025-05-07T20:33:00.8651423Z Entering 'external/asmjit' 2025-05-07T20:33:00.8694138Z http.https://github.com/.extraheader 2025-05-07T20:33:00.8737922Z Entering 'external/composable_kernel' 2025-05-07T20:33:00.8780088Z http.https://github.com/.extraheader 2025-05-07T20:33:00.8829554Z Entering 'external/cpuinfo' 2025-05-07T20:33:00.8871978Z http.https://github.com/.extraheader 2025-05-07T20:33:00.8915512Z Entering 'external/cutlass' 2025-05-07T20:33:00.8957470Z http.https://github.com/.extraheader 2025-05-07T20:33:00.9008639Z Entering 'external/googletest' 2025-05-07T20:33:00.9050590Z http.https://github.com/.extraheader 2025-05-07T20:33:00.9092930Z Entering 'external/hipify_torch' 2025-05-07T20:33:00.9135263Z http.https://github.com/.extraheader 2025-05-07T20:33:00.9176997Z Entering 'external/json' 2025-05-07T20:33:00.9220652Z http.https://github.com/.extraheader 2025-05-07T20:33:00.9369312Z A job completed hook has been configured by the self-hosted runner administrator 2025-05-07T20:33:00.9404581Z ##[group]Run '/home/ec2-user/runner-scripts/after_job.sh' 2025-05-07T20:33:00.9415033Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:33:00.9415393Z ##[endgroup] 2025-05-07T20:33:00.9514629Z [!ALERT!] Swap in detected! [!ALERT!] 2025-05-07T20:33:11.7054356Z [!ALERT!] Swap out detected [!ALERT!] 2025-05-07T20:33:28.0591910Z Cleaning up orphan processes